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Abstract 


The pace of research on peer-to-peer (P2P) networking in the last 
five years warrants a critical survey. P2P has the makings of a 
disruptive technology -- it can aggregate enormous storage and 
processing resources while minimizing entry and scaling costs. 


Failures are common amongst massive numbers of distributed peers, 
though the impact of individual failures may be less than in 


conventional architectures. Thus, the key to realizing P2P’s 
potential in applications other than casual file sharing is 
robustness. 


P2P search methods are first couched within an overall P2P taxonomy. 
P2P indexes for simple key lookup are assessed, including those based 
on Plaxton trees, rings, tori, butterflies, de Bruijn graphs, and 
skip graphs. Similarly, P2P indexes for keyword lookup, information 
retrieval and data management are explored. Finally, early efforts 
to optimize range, multi-attribute, join, and aggregation queries 


over P2P indexes are reviewed. Insofar as they are available in the 
primary literature, robustness mechanisms and metrics are highlighted 
throughout. However, the low-level mechanisms that most affect 


robustness are not well isolated in the literature. Recommendations 
are given for future research. 
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1. Introduction 


Peer-to-peer (P2P) networks are those that exhibit three 
characteristics: self-organization, symmetric communication, and 
distributed control [1]. A self-organizing P2P network 
"automatically adapts to the arrival, departure and failure of nodes" 
[2]. Communication is symmetric in that peers act as both clients 
and servers. It has no centralized directory or control point. 
USENET servers and BGP peers have these traits [3] but the emphasis 
here is on the flurry of research since 2000. Leading examples 
include Gnutella [4], Freenet [5], Pastry [2], Tapestry [6], Chord 
[7], the Content Addressable Network (CAN) [8], pSearch [9], and 


Edutella [10]. Some have suggested that peers are inherently 
unreliable [11]. Others have assumed well-connected, stable peers 
[12]. 


This critical survey of P2P academic literature is warranted, given 
the intensity of recent research. At the time of writing, one 
research database lists over 5,800 P2P publications [13]. One vendor 
surveyed P2P products and deployments [14]. There is also a tutorial 
survey of leading P2P systems [15]. DePaoli and Mariani recently 
reviewed the dependability of some early P2P systems at a high level 
[16]. The need for a critical survey was flagged in the peer-to-peer 
research group of the Internet Research Task Force (IRTF) [17]. 


P2P is potentially a disruptive technology with numerous 
applications, but this potential will not be realized unless it is 
demonstrated to be robust. A massively distributed search technique 
may yield numerous practical benefits for applications [18]. A P2P 
system has potential to be more dependable than architectures relying 
on a small number of centralized servers. It has potential to evolve 
better from small configurations -- the capital outlays for high 
performance servers can be reduced and spread over time if a P2P 
assembly of general purpose nodes is used. A similar argument 


motivated the deployment of distributed databases -- one thousand, 
off-the-shelf PC processors are more powerful and much less expensive 
than a large mainframe computer [19]. Storage and processing can be 


aggregated to achieve massive scale. Wasteful partitioning between 
servers or clusters can be avoided. As Gedik and Liu put it, if P2P 
is to find its way into applications other than casual file sharing, 
then reliability needs to be addressed [20]. 


Risson & Moors Informational [Page 3] 


RFC 4981 Survey of Research on P2P Search September 2007 


The taxonomy of Figure 1 divides the entire body of P2P research 
literature along four lines: search, storage, security, and 
applications. This survey concentrates on search aspects. A P2P 
search network consists of an underlying index (Sections 2 to 4) and 
queries that propagate over that index (Section 5). 


Search [18, 21-29] 

Semantic-Free Indexes [2, 6, 7, 30-52] 
Plaxton Trees 
Rings 
Tori 
Butterflies 
de Bruijn Graphs 
Skip Graphs 

Semantic Indexes [4, 53-71] 
Keyword Lookup 
Peer Information Retrieval 
Peer Data Management 

Queries [20, 22, 23, 25, 32, 38, 41, 56, 72-100] 
Range Queries 
Multi-Attribute Queries 
Join Queries 
Aggregation Queries 
Continuous Queries 
Recursive Queries 
Adaptive Queries 


Storage 

Consistency & Replication [101-112] 
Eventual consistency 
Trade-offs 

Distribution [39, 42, 90, 92, 113-131] 
Epidemics, Bloom Filters 

Fault Tolerance [40, 105, 132-139] 
Erasure Coding 
Byzantine Agreement 

Locality [24, 43, 47, 140-160] 

Load Balancing [37, 86, 100, 107, 151, 161-171] 
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Security 

Character [172-182] 
Identity 
Reputation and Trust 
Incentives 

Goals [25, 27, 71, 183-197] 
Availability 
Authenticity 
Anonymity 
Access Control 
Fair Trading 


Applications [1, 198-200] 
Memory [32, 90, 142, 201-222] 
File Systems 
Web 
Content Delivery Networks 
Directories 
Service Discovery 
Publish / Subscribe 
Intelligence [223-228] 


GRID 

Security... 
Communication [12, 92, 119, 229-247] 

Multicasting 

Streaming Media 

Mobility 

Sensors... 

Figure 1: Classification of P2P Research Literature 

This survey is concerned with two questions. The first, "How do P2P 
search networks work?" This foundation is important given the pace 
and breadth of P2P research in the last five years. In Section 2, we 


classify indexes as local, centralized and distributed. Since 
distributed indexes are becoming dominant, they are given closer 
attention in Sections 3 and 4. Section 3 compares distributed P2P 
indexes for simple key lookup; in particular, their origins (Section 
3.1), dependability (Section 3.2), latency (Section 3.3), and their 
support for multicast (Section 3.4). It classifies those indexes 
according to their routing geometry (Section 3.5) -- Plaxton trees, 
rings, tori, butterflies, de Bruijn graphs and skip graphs. Section 
4 reviews distributed P2P indexes supporting keyword lookup (Section 
4.1) and information retrieval (Section 4.2). Section 5 probes the 
embryonic research on P2P queries; in particular, range queries 
(Section 5.1), multi-attribute queries (Section 5.2), join queries 
(Section 5.3), and aggregation queries (Section 5.4). 
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The second question, "How robust are P2P search networks?" Insofar 
as it is available in the research literature, we tease out the 
robustness mechanisms and metrics throughout Sections 2 to 5. 
Unfortunately, robustness is often more sensitive to low-level design 
choices than it is to the broad P2P index structure, yet these 
underlying design choices are seldom isolated in the primary 
literature [248]. Furthermore, there has been little consensus on 
P2P robustness metrics (Section 3.2). Section 8 gives 
recommendations to address these important gaps. 


1.1. Related Disciplines 


Peer-to-peer research draws upon numerous distributed systems 
disciplines. Networking researchers will recognize familiar issues 
of naming, routing, and congestion control. P2P designs need to 
address routing and security issues across network region boundaries 
[152]. Networking research has traditionally been host-centric. The 
Web’s Universal Resource Identifiers are naturally tied to specific 
hosts, making object mobility a challenge [216]. 


P2P work is data-centric [249]. P2P systems for dynamic object 
location and routing have borrowed heavily from the distributed 
systems corpus. Some have used replication, erasure codes, and 
Byzantine agreement [111]. Others have used epidemics for durable 
peer group communication [39]. 


Similarly, P2P research is set to benefit from database research 
[250]. Database researchers will recognize the need to reapply 
Codd’s principle of physical data independence, that is, to decouple 
data indexes from the applications that use the data [23]. It was 
the invention of appropriate indexing mechanisms and query 
optimizations that enabled data independence. Database indexes like 
B+ trees have an analog in P2P’s distributed hash tables (DHTs). 
Wide-area, P2P query optimization is a ripe, but challenging, area 
for innovation. 


More flexible distribution of objects comes with increased security 
risks. There are opportunities for security researchers to deliver 
new methods for availability, file authenticity, anonymity, and 
access control [25]. Proactive and reactive mechanisms are needed to 
deal with large numbers of autonomous, distributed peers. To build 
robust systems from cooperating but self-interested peers, issues of 
identity, reputation, trust, and incentives need to be tackled. 
Although it is beyond the scope of this paper, robustness against 
malicious attacks also ought to be addressed [195]. 


Possibly the largest portion of P2P research has majored on basic 
routing structures [18], where research on algorithms comes to the 
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fore. Should the overlay be "Structured" or "unstructured"? Are the 
two approaches competing or complementary? Comparisons of the 
"structured" approaches (hypercubes, rings, toroids, butterflies, de 
Bruijn, and skip graphs) have weighed the amount of routing state per 
peer and the number of links per peer against overlay hop counts. 
While "unstructured" overlays initially used blind flooding and 
random walks, overheads usually trigger some structure, for example, 
super-peers and clusters. 


P2P applications rely on cooperation between these disciplines. 
Applications have included file sharing, directories, content 
delivery networks, email, distributed computation, publish-subscribe 
middleware, multicasting, and distributed authentication. Which 
applications will be suited to which structures? Are there adaptable 
mechanisms that can decouple applications from the underlying data 
structures? What are the criteria for selection of applications 
amenable to a P2P design [1]? 


Robustness is emphasized throughout the survey. We are particularly 
interested in two aspects. The first, dependability, was a leading 
design goal for the original Internet [251]. It deserves the same 
status in P2P. The measures of dependability are well established: 
reliability, a measure of the mean-time-to-failure (MTTF); 
availability, a measure of both the MTTF and the mean-time-to-repair 
(MTTR); maintainability; and safety [252]. The second aspect is the 
ability to accommodate variation in outcome, which one could call 
adaptability. Its measures have yet to be defined. In the context 
of the Internet, it was only recently acknowledged as a first-class 
requirement [253]. In P2P, it means planning for the tussles over 
resources and identity. It means handling different kinds of queries 
and accommodating changeable application requirements with minimal 
intervention. It means "organic scaling" [22], whereby the system 
grows gracefully, without a priori data center costs or architectural 
breakpoints. 


In the following section, we discuss one notable omission from the 
taxonomy of P2P networking in Figure 1 -- routing. 


1.2. Structured and Unstructured Routing 


P2P routing algorithms have been classified as "Structured" or 
"unstructured". Peers in unstructured overlay networks join by 
connecting to any existing peers [254]. In structured overlays, the 
identifier of the joining peer determines the set of peers that it 
connects to [254]. Early instantiations of Gnutella were 
unstructured -- keyword queries were flooded widely [255]. Napster 
[256] had decentralized content and a centralized index, so it only 
partially satisfies the distributed control criteria for P2P systems. 
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Early structured algorithms included Plaxton, Rajaraman and Richa 
(PRR) [30], Pastry [2], Tapestry [31], Chord [7], and the Content 
Addressable Network [8]. Mishchke and Stiller recently classified 
P2P systems by the presence or absence of structure in routing tables 
and network topology [257]. 


Some have cast unstructured and structured algorithms as competing 
alternatives. Unstructured approaches have been called "first 
generation", implicitly inferior to the "second generation" 
structured algorithms [2, 31]. When generic key lookups are 
required, these structured, key-based routing schemes can guarantee 
location of a target within a bounded number of hops [23]. The 
broadcasting unstructured approaches, however, may have large routing 
costs, or fail to find available content [22]. Despite the apparent 
advantages of structured P2P, several research groups are still 
pursuing unstructured P2P. 


There have been two main criticisms of structured systems [61]. The 
first relates to peer transience, which in turn, affects robustness. 
Chawathe, et al. opined that highly transient peers are not well 
supported by DHTs [61]. P2P systems often exhibit "churn", with 
peers continually arriving and departing. One objection to concerns 
about highly transient peers is that many applications use peers in 
well-connected parts of the network. The Tapestry authors analyzed 
the impact of churn in a network of 1000 nodes [31]. Others opined 
that it is possible to maintain a robust DHT at relatively low cost 
[258]. Very few papers have quantitatively compared the resilience 
of structured systems. Loguinov, Kumar, et al. claimed that there 
were only two such works [24, 36]. 


The second criticism of structured systems is that they do not 
support keyword searches and complex queries as well as unstructured 
systems. Given the current file-sharing deployments, keyword 
searches seem more important than exact-match key searches in the 
short term. Paraphrased, "most queries are for hay, not needles" 
[61]. 


More recently, some have justifiably seen unstructured and structured 
proposals as complementary, and have devised hybrid models [259]. 
Their starting point was the observation that unstructured flooding 
or random walks are inefficient for data that is not highly 
replicated across the P2P network. Structured graphs can find keys 
efficiently, irrespective of replication. Castro, et al. proposed 
Structella, a hybrid of Gnutella built on top of Pastry [259]. 
Another design used structured search for rare items and unstructured 
search for massively replicated items [54]. 
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However, the "structured versus unstructured routing" taxonomy is 
becoming less useful, for two reasons, Firstly, most "unstructured" 
proposals have evolved and incorporated structure. Consider the 
classic "unstructured" system, Gnutella [4]. For scalability, its 
peers are either ultrapeers or leaf nodes. This hierarchy is 
augmented with a query routing protocol whereby ultrapeers receive a 
hashed summary of the resource names available at leaf nodes. 
Between ultrapeers, simple query broadcast is still used, though 
methods to reduce the query load here have been considered [260]. 
Secondly, there are emerging schema-based P2P designs [59], with 
super-node hierarchies and structure within documents. These are 
quite distinct from the structured DHT proposals. 


1.3. Indexes and Queries 


Given that most, if not all, P2P designs today assume some structure, 
a more instructive taxonomy would describe the structure. In this 
survey, we use a database taxonomy in lieu of the networking 
taxonomy, as suggested by Hellerstein, Cooper, and Garcia-Molina [23, 
261]. The structure is determined by the type of index (Sections 2 , 
3, and 4). Queries feature in lieu of routing (Section 5). The DHT 
algorithms implement a "semantic-free index" [216]. They are 
oblivious of whether keys represent document titles, meta-data, or 
text. Gnutella-like and schema-based proposals have a "semantic 
index". 


Index engineering is at the heart of P2P search methods. It captures 
a broad range of P2P issues, as demonstrated by the Search/Index 
Links model [261]. As Manber put it, "the most important of the 
tools for information retrieval is the index -- a collection of terms 
with pointers to places where information about documents can be 
found" [262]. Sen and Wang noted that a "P2P network" usually 
consists of connections between hosts for application-layer 
Signaling, rather than for the data transfer itself [263]. 

Similarly, we concentrate on the "signaled" indexes and queries. 


Our focus here is the dependability and adaptability of the search 
network. Static dependability is a measure of how well queries route 
around failures in a network that is normally fault-free. Dynamic 
dependability gives an indication of query success when nodes and 
data are continually joining and leaving the P2P system. An 
adaptable index accommodates change in the data and query 
distribution. It enables data independence, in that it facilitates 
changes to the data layout without requiring changes to the 
applications that use the data [23]. An adaptable P2P system can 
support rich queries for a wide range of applications. Some 
applications benefit from simple, semantic-free key lookups [264]. 
Others require more complex, Structured Query Language (SQL)-like 
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queries to find documents with multiple keywords, or to aggregate or 
join query results from distributed relations [22]. 


2. Index Types 


A P2P index can be local, centralized, or distributed. With a local 
index, a peer only keeps the references to its own data, and does not 
receive references for data at other nodes. The very early Gnutella 
design epitomized the local index (Section 2.1). In a centralized 
index, a single server keeps references to data on many peers. The 
classic example is Napster (Section 2.2). With distributed indexes, 
pointers towards the target reside at several nodes. One very early 
example is Freenet (Section 2.3). Distributed indexes are used in 
most P2P designs nowadays -- they dominate this survey. 


P2P indexes can also be classified as non-forwarding and forwarding. 
When queries are guided by a non-forwarding index, they jump to the 
node containing the target data in a single hop. There have been 


semantic and semantic-free one-hop schemes [138, 265, 266]. Where 
scalability to a massive number of peers is required, these schemes 
have been extended to two hops [267, 268]. More common are the 


forwarding P2Ps, where the number of hops varies with the total 
number of peers, often logarithmically. The related trade-offs 
between routing state, lookup latency, update bandwidth, and peer 
churn are critical to total system dependability. 


2.1. Local Index (Gnutella) 


P2Ps with a purely local data index are becoming rare. In such 
designs, peers flood queries widely and only index their own content. 
They enable rich queries - the search is not limited to a simple key 
lookup. However, they also generate a large volume of query traffic 
with no guarantee that a match will be found, even if it does exist 
on the network. For example, to find potential peers on the early 
instantiations of Gnutella, 'ping' messages were broadcast over the 
P2P network and the ’pong’ responses were used to build the node 
index. Then, small 'query” messages, each with a list of keywords, 
are broadcast to peers that respond with matching filenames [4]. 


There have been numerous attempts to improve the scalability of 
local-index P2P networks. Gnutella uses fixed time-to-live (TTL) 
rings, where the query's TTL is set less than 7-10 hops [4]. Small 
TTLs reduce the network traffic and the load on peers, but also 
reduce the chances of a successful query hit. One paper reported, 
perhaps a little too bluntly, that the fixed "TTL-based mechanism 
does not work" [67]. To address this TTL selection problem, they 
proposed an expanding ring, known elsewhere as iterative deepening 
[29]. It uses successively larger TTL counters until there is a 
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match. The flooding, ring, and expanding ring methods all increase 
network load with duplicated query messages. A random walk, whereby 
an unduplicated query wanders about the network, does indeed reduce 
the network load but massively increases the search latency. One 
solution is to replicate the query k times at each peer. Called 
random k-walkers, this technique can be coupled with TTL limits, or 
periodic checks with the query originator, to cap the query load 
[67]. Adamic, Lukose, et al. suggested that the random walk searches 
be directed to nodes with a higher degree, that is, with larger 
numbers of inter-peer connections [269]. They assumed that higher- 
degree peers are also capable of higher query throughputs. However, 
without some balancing design rule, such peers would be swamped with 
the entire P2P signaling traffic. In addition to the above 
approaches, there is the 'directed breadth-first’ algorithm [29]. It 
forwards queries within a subset of peers selected according to 
heuristics on previous performance, like the number of successful 
query results. Another algorithm, called probabilistic flooding, has 
been modeled using percolation theory [270]. 


Several measurement studies have investigated locally indexed P2Ps. 
Jovanovic noted Gnutella’s power law behaviour [70]. Sen and Wang 
compared the performance of Gnutella, Fasttrack [271], and Direct 
Connect [263, 272, 273]. At the time, only Gnutella used local data 
indexes. All three schemes now use distributed data indexes, with 
hierarchy in the form of Ultrapeers (Gnutella), Super-Nodes 
FastTrack), and Hubs (Direct Connect). It was found that a very 
small percentage of peers have a very high degree and that the total 
system dependability is at the mercy of such peers. While peer up- 
time and bandwidth were heavy-tailed, they did not fit well with the 
Zipf distribution. Fortunately for Internet Service Providers, 
measures aggregated by IP prefix and Autonomous System (AS) were more 
stable than for individual IP addresses. A study of University of 
Washington traffic found that Gnutella and Kazaa together contributed 
43% of the university’s total TCP traffic [274]. They also reported 
a heavy-tailed distribution, with 600 external peers (out of 281,026) 
delivering 26% of Kazaa bytes to internal peers. Furthermore, 
objects retrieved from the P2P network were typically three orders of 
magnitude larger than Web objects -- 300 objects contributed to 
almost half the total outbound Kazaa bandwidth. Others reported 
Gnutella’s topology mismatch, whereby only 2-5% of P2P connections 
link peers in the same Autonomous System (AS), despite over 40% of 
peers being in the top 10 ASs [65]. Together these studies 
underscore the significance of multimedia sharing applications. They 
motivate interesting caching and locality solutions to the topology 
mismatch problem. 


These same studies bear out one main dependability lesson: total 
system dependability may be sensitive to the dependability of high- 
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degree peers. The designers of Scamp translated this observation to 
the design heuristic, "have the degree of each node be of nearly 
equal size" [153]. They analyzed a system of N peers, with mean 
degree c.log(n), where link failures occur independently with 
probability e. If d>0 is fixed and c>(1+d)/(-log(e)), then the 
probability of graph disconnection goes to zero as N->infinity. 
Otherwise, if c<(1-d)/(-log(e)), then the probability of 
disconnection goes to one as N->infinity. They presented a 
localizer, which finds approximate minima to a global function of 
peer degree and arbitrary link costs using only local information. 
The Scamp overlay construction algorithms could support any of the 
flooding and walking routing schemes above, or other epidemic and 
multicasting schemes for that matter. Resilience to high churn rates 
was identified for future study. 


2.2. Central Index (Napster) 


Centralized schemes like Napster [256] are significant because they 
were the first to demonstrate the P2P scalability that comes from 
separating the data index from the data itself. Ultimately, 36 
million Napster users lost their service not because of technical 
failure, but because the single administration was vulnerable to the 
legal challenges of record companies [275]. 


There has since been little research on P2P systems with central data 
indexes. Such systems have also been called 'hybrid” since the index 
is centralized but the data is distributed. Yang and Garcia-Molina 
devised a four-way classification of hybrid systems [276]: unchained 
servers, where users whose index is on one server do not see other 
servers’ indexes; chained servers, where the server that receives a 
query forwards it to a list of servers if it does not own the index 
itself; full replication, where all centralized servers keep a 
complete index of all available metadata; and hashing, where keywords 
are hashed to the server where the associated inverted list is kept. 
The unchained architecture was used by Napster, but it has the 
disadvantage that users do not see all indexed data in the system. 
Strictly speaking, the other three options illustrate the distributed 
data index, not the central index. The chained architecture was 
recommended as the optimum for the music-swapping application at the 
time. The methods by which clients update the central index were 
classified as batch or incremental, with the optimum determined by 
the query-to-login ratio. Measurements were derived from a clone of 
Napster called OpenNap[277]. Another study of live Napster data 
reported wide variation in the availability of peers, a general 
unwillingness to share files (20-40% of peers share few or no files), 
and a common understatement of available bandwidth so as to 
discourage other peers from sharing one’s link [202]. 
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Influenced by Napster’s early demise, the P2P research community may 
have prematurely turned its back on centralized architectures. 
Chawathe, Ratnasamy, et al. opined that Google and Yahoo demonstrate 
the viability of a centralized index. They argued that "the real 
barriers to Napster-like designs are not technical but legal and 
financial" [61]. Even this view may be a little too harsh on the 
centralized architectures -- it implies that they always have an up- 
front capital hurdle that is steeper than for distributed 
architectures. The closer one looks at scalable ’centralized’ 
architectures, the less the distinction with “distributed” 
architectures seems to matter. For example, it is clear that 
Google’s designers consider Google a distributed, not centralized, 
file system [278]. Google demonstrates the scale and performance 
possible on commodity hardware, but still has a centralized master 
that is critical to the operation of each Google cluster. Time may 
prove that the value of emerging P2P networks, regardless of the 
centralized-versus-distributed classification, is that they smooth 
the capital outlays and remove the single points of failure across 
the spectra of scale and geographic distribution. 


2.3. Distributed Index (Freenet) 


An important early P2P proposal for a distributed index was Freenet 
[5, 71, 279]. While its primary emphasis was the anonymity of peers, 
it did introduce a novel indexing scheme. Files are identified by 
low-level "content-hash" keys and by "secure signed-subspace" keys, 
which ensure that only a file owner can write to a file while anyone 
can read from it. To find a file, the requesting peer first checks 
its local table for the node with keys closest to the target. When 
that node receives the query, it too checks for either a match or 
another node with keys close to the target. Eventually, the query 
either finds the target or exceeds time-to-live (TTL) limits. The 
query response traverses the successful query path in reverse, 
depositing a new routing table entry (the requested key and the data 
holder) at each peer. The insert message similarly steps towards the 
target node, updating routing table entries as it goes, and finally 
stores the file there. Whereas early versions of Gnutella used 
breadth-first flooding, Freenet uses a more economic depth-first 
search [280]. 


An initial assessment has been done of Freenet’s robustness. It was 
shown that in a network of 1000 nodes, the median query path length 
stayed under 20 hops for a failure of 30% of nodes. While the 
Freenet designers considered this as evidence that the system is 
"surprisingly robust against quite large failures" [71], the same 
datapoint may well be outside meaningful operating bounds. How many 
applications are useful when the first quartile of queries have path 
lengths of several hundred hops in a network of only 1000 nodes, per 
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Figure 4 of [71]? To date, there has been no analysis of Freenet’s 
dynamic robustness. For example, how does it perform when nodes are 
continually arriving and departing? 


There have been both criticisms and extensions of the early Freenet 
work. Gnutella proponents acknowledged the merit in Freenet’s 
avoidance of query broadcasting [281]. However, they are critical on 
two counts: the exact file name is needed to construct a query; and 
exactly one match is returned for each query. P2P designs using 


DHTs, per Section 3, share similar characteristics -- a precise query 
yields a precise response. The similarity is not surprising since 
Freenet also uses a hash function to generate keys. However, the 


query routing used in the DHTs has firmer theoretical foundations. 
Another difference with DHTs is that Freenet will take time, when a 
new node joins the network, to build an index that facilitates 
efficient query routing. By the inventor’s own admission, this is 
damaging for a user’s first impressions [282]. It was proposed to 
download a copy of routing tables from seed nodes at startup, even 
though the new node might be far from the seed node. Freenet’s slow 
startup motivated Mache, Gilbert, et al. to amend the overlay after 
failed requests and to place additional index entries on successful 
requests -- they claim almost an order of magnitude reduction in 
average query path length [280]. Clarke also highlighted the lack of 
locality or bandwidth information available for efficient query 
routing decisions [282]. He proposed that each node gather response 
times, connection times, and proportion of successful requests for 
each entry in the query routing table. When searching for a key that 
is not in its own routing table, it was proposed to estimate response 
times from the routing metrics for the nearest known keys and 
consequently choose the node that can retrieve the data fastest. The 
response time heuristic assumed that nodes close in the key space 
have similar response times. This assumption stemmed from early 
deployment observations that Freenet peers seemed to specialize in 
parts of the keyspace -- it has not been justified analytically. 
Kronfol drew attention to Freenet’s inability to do keyword searches 
[283]. He suggested that peers cache lists of weighted keywords in 
order to route queries to documents, using Term Frequency Inverse 
Document Frequency (TFIDF) measures and inverted indexes (Section 
4.2.1). With these methods, a peer can route queries for simple 
keyword lists or more complicated conjunctions and disjunctions of 
keywords. Robustness analysis and simulation of Kronfol’s proposal 
remain open. 


The vast majority of P2P proposals in following sections rely ona 
distributed index. 
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3- 


Sas 


Semantic Free Index 


Many of today’s distributed network indexes are semantic. The 
semantic index is human-readable. For example, it might associate 
information with other keywords, a document, a database key, or even 
an administrative domain. It makes it easy to associate objects with 
particular network providers, companies, or organizations, as 
evidenced in the Domain Name System (DNS). However, it can also 
trigger legal tussles and frustrate content replication and migration 
[216]. 


Distributed Hash Tables (DHTs) have been proposed to provide 
semantic-free, data-centric references. DHTs enable one to find an 
object’s persistent key in a very large, changing set of hosts. They 
are typically designed for [23]: 


a) low degree. If each node keeps routing information for only a 
small number of other nodes, the impact of high node arrival and 
departure rates is contained; 


b) low hop count. The hops and delay introduced by the extra 
indirection are minimized; 


c) greedy routing. Nodes independently calculate a short path to the 
target. At each hop, the query moves closer to the target; and 


d) robustness. A path to the target can be found even when links or 
nodes fail. 


.1. Origins 


To understand the origins of recent DHTs, one needs to look to three 
contributions from the 1990s. The first two -- Plaxton, Rajaraman, 
and Richa (PRR) [30] and Consistent Hashing [49] -- were published 
within one month of each other. The third, the Scalable Distributed 
Data Structure (SDDS) [52], was curiously ignored in significant 
structured P2P designs despite having some similar goals [2, 6, 7]. 
It has been briefly referenced in other P2P papers [46, 284-287]. 


1.1. Plaxton, Rajaraman, and Richa (PRR) 


PRR is the most recent of the three. It influenced the designs of 
Pastry [2], Tapestry [6], and Chord [7]. The value of PRR is that it 
can locate objects using fixed-length routing tables [6]. Objects 
and nodes are assigned a semantic-free address, for example a 160-bit 
key. Every node is effectively the root of a spanning tree. A 
message routes toward an object by matching longer address suffixes, 
until it encounters either the object’s root node or another node 
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with a 'nearby” copy. It can route around link and node failure by 
matching nodes with a related suffix. The scheme has several 
disadvantages [6]: global knowledge is needed to construct the 
overlay; an object’s root node is a single point of failure; nodes 
cannot be inserted and deleted; and there is no mechanism for queries 
to avoid congestion hot spots. 


3.1.2. Consistent Hashing 


Consistent Hashing [288] strongly influenced the designs of Chord [7] 


and Koorde [37]. Karger, et al. introduced Consistent Hashing in the 
context of the Web-caching problem [49]. Web servers could 
conceivably use standard hashing to place objects across a network of 
caches. Clients could use the approach to find the objects. For 


normal hashing, most object references would be moved when caches are 
added or deleted. On the other hand, Consistent Hashing is "smooth" 
-- when caches are added or deleted, the minimum number of object 
references move so as to maintain load balancing. Consistent Hashing 
also ensures that the total number of caches responsible for a 
particular object is limited. Whereas Litwin’s Linear Hashing (LH*) 
scheme requires ‘buckets’ to be added one at a time in sequence [50], 
Consistent Hashing allows them to be added in any order [49]. There 
is an open Consistent Hashing problem pertaining to the fraction of 
items moved when a node is inserted [165]. Extended Consistent 
Hashing was recently proposed to randomize queries over the spread of 
caches to significantly reduce the load variance [289]. 
Interestingly, Karger [49] referred to an older DHT algorithm by 
Devine that used "a novel autonomous location discovery algorithm 
that learns the buckets’ locations instead of using a centralized 
directory" [51]. 


3.1.3. Scalable Distributed Data Structures (LH*) 


In turn, Devine’s primary point of reference was Litwin’s work on 
SDDSs and the associated LH* algorithm [52]. An SDDS satisfies three 
design requirements: files grow to new servers only when existing 
servers are well loaded; there is no centralized directory; and the 
basic operations like insert, search, and split never require atomic 
updates to multiple clients. Honicky and Miller suggested the first 
requirement could be considered a limitation since expansion to new 


servers is not under administrative control [286]. Litwin recently 
noted numerous similarities and differences between LH* and Chord 
[290]. He found that both implement key search. Although LH* refers 


to clients and servers, nodes can operate as peers in both. Chord 
‘splits’ nodes when a new node is inserted, while LH* schedules 
‘splits’ to avoid overload. Chord requests travel O(log n) hops, 
while LH* client requests need, at most, two hops to find the target. 
Chord stores a small number of 'fingers” at each node. LH* servers 
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store N/2 to N addresses while LH* clients store 1 to N addresses. 
This trade-off between hop count and the size of the index affects 
system robustness, and bears striking similarity to recent one- and 
two-hop P2P schemes in Section 2. The arrival and departure of LH* 
clients does not disrupt LH* server metadata at all. Given the size 
of the index, the arrival and departure of LH* servers are likely to 
cause more churn than that of Chord nodes. Unlike Chord, LH* has a 
single point of failure, the split coordinator. It can be 
replicated. Alternatively, it can be removed in later LH* variants, 
though details have not been progressed for lack of practical need 
[290]. 


3.2. Dependability 


We make four overall observations about their dependability. 
Dependability metrics fall into two categories: static dependability, 
a measure of performance before recovery mechanisms take over; and 
dynamic dependability, for the most likely case in massive networks 
where there is continual failure and recovery ("Churn"). 


3.2.1. Static Dependability 


Observation A: Static dependability comparisons show that no O(log n) 
DHT geometry is significantly more dependable than the other O(log n) 
geometries. 


Gummadi, et al. compared the tree, hypercube, butterfly, ring, XOR, 
and hybrid geometries. In such geometries, nodes generally know 
about O(log n) neighbors and route to a destination in O(log n) hops, 
where N is the number of nodes in the overlay. Gummadi, et al. asked 
"Why not the ring?" They concluded that only the ring and XOR 
geometries permit flexible choice of both neighbors and alternative 
routes [24]. Loguinov, et al. added the de Bruijn graph to their 
comparison [36]. They concluded that the classical analyses, for 
example the probability that a particular node becomes disconnected, 
yield no major differences between the resilience of Chord, CAN, and 
de Bruijn graphs. Using bisection width (the minimum edge count 
between two equal partitions) and path overlap (the likelihood that 
backup paths will encounter the same failed nodes or links as the 
primary path), they argued for the superior resilience of the de 
Bruijn graph. In short, ring, XOR, and de Bruijn graphs all permit 
flexible choice of alternative paths, but only in de Bruijn are the 
alternate paths independent of each other [36]. 
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3.2.2. Dynamic Dependability 


Observation B: Dynamic dependability comparisons show that DHT 
dependability is sensitive to the underlying topology maintenance 
algorithms. 


Li, et al. give the best comparison to date of several leading DHTs 
during churn [291]. They relate the disparate configuration 
parameters of Tapestry, Chord, Kademlia, Kelips, and OneHop to 
fundamental design choices. For each of these DHTs, they plotted the 
optimal performance in terms of lookup latency (milliseconds) and 
fraction of failed lookups. The results led to several important 
insights about the underlying algorithms, for example: increasing 
routing table size is more cost-effective than increasing the rate of 
periodic stabilization; learning about new nodes during the lookup 
process sometimes eliminates the need for stabilization; and parallel 
lookups reduce latency due to timeouts more effectively than faster 


stabilization. Similarly, Zhuang, et al. compared keep-alive 
algorithms for DHT failure detection [292]. Such algorithmic 
comparisons can significantly improve the dependability of DHT 
designs. 


In Figure 2, we propose a taxonomy for the topology maintenance 
algorithms that influence dependability. The algorithms can be 
classified by how nodes join and leave, how they first detect 
failures, how they share information about topology updates, and how 
they react when they receive information about topology updates. 
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Normal Updates 
Joins (passive; active) [293] 
Leaves (passive; active) [293] 


Fault Detection [292] 


Maintenance 

Proactive (periodic or keep-alive probes) 

Reactive (correction-on-use, correction-on-failure) [294] 
Report 


Negative (all dead nodes, nodes recently failed) 
Positive (all live nodes; nodes recently recovered) [292] 


Topology Sharing: yes/ no [292] 
Multicast Tree (explicit, implicit) [267, 295] 
Gossip (timeouts; number of contacts) [39] 


Corrective Action 
Routing 
Rerouting actions 
(reroute once; route in parallel [291]; reject) 
Routing timeouts 
(TCP-style, virtual coordinates) [296] 
Topology 
Update action (evict/ replace/ tag node) 
Update timeliness (immediate, periodic[296], delayed [297]) 


Figure 2: Topology Maintenance in Distributed Hash Tables 
3.2.3. Ephemeral or Stable Nodes -- O(log n) or O(1) Hops 


Observation C: Most DHTs use O(log n) geometries to suit ephemeral 
nodes. The O(1) hop DHTs suit stable nodes and deserve more research 
attention. 


Most of the DHTs in Section 3.5 assume that nodes are ephemeral, with 
expected lifetimes of one to two hours. Therefore, they mostly use 
an O(log n) geometry. The common assumption is that maintenance of 
full routing tables in the O(1) hop DHTs will consume excessive 
bandwidth when nodes are continually joining and leaving. The 
corollary is that, when they run on stable infrastructure servers 
[298], most of the DHTs in Section 3.5 are less than optimal -- 
lookups take many more hops than necessary, wasting latency and 
bandwidth budgets. The O(1) hop DHTs suit stable deployments and 
high lookup rates. For a churning 1024-node network, Li, et al. 
concluded that OneHop is superior to Chord, Tapestry, Kademlia, and 
Kelips in terms of latency and lookup success rate [291]. Fora 
3000-node network, they concluded that "OneHop is only preferable to 
Chord when the deployment scenario allows a communication cost 
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greater than 20 bytes per node per second" [291]. This apparent 
limitation needs to be put in context. They assumed that each node 
issues only one lookup every 10 minutes and has a lifetime of only 60 
minutes. It seems reasonable to expect that in some deployments, 
nodes will have a lifetime of weeks or more, a maintenance bandwidth 
of tens of kilobits per second, and a load of hundreds of lookups per 
second. O(1) hop DHTs are superior in such situations. OneHop can 
scale at least to many tens of thousands of nodes [267]. The recent 
O(1) hop designs [267, 295] are vastly outnumbered by the O(log n) 
DHTs in Section 3.5. Research on the algorithms of Figure 2 will 
also yield improvements in the dependability of the O(1) hop DHTs. 


3.2.4. Simulation and Proof 


Observation D: Although not yet a mature science, the study of DHT 
dependability is helped by recent simulation and formal development 
tools. 


While there are recent reference architectures [294, 298], much of 
the DHT literature in Section 3.5 does not lend itself to repeatable, 
comparative studies. The best comparative work to date [291] relies 
on the Peer-to-Peer Simulator (P2PSIM) [299]. At the time of 
writing, it supports more DHT geometries than any other simulator. 

As the study of DHTs matures, we can expect to see the simulation 
emphasis shift from geometric comparison to a comparison of the 
algorithms of Figure 2. 


P2P correctness proofs generally rely on less-than-complete formal 
specifications of system invariants and events [7, 45, 300]. Li and 
Plaxton expressed concern that "when many joins and leaves happen 
concurrently, it is not clear whether the neighbor tables will remain 
in a 'good” state" [47]. While acknowledging that guaranteeing 
consistency in a failure-prone network is impossible, Lynch, Malkhi, 
et al. sketched amendments to the Chord algorithm to guarantee 


atomicity [301]. More recently, Gilbert, Lynch, et al. gave a new 
algorithm for atomic read/write memory in a churning distributed 
network, suggesting it to be a good match for P2P [302]. Lynch and 


Stoica show in an enhancement to Chord that lookups are provably 
correct when there is a limited rate of joins and failures [303]. 
Fault Tolerant Active Rings is a protocol for active joins and leaves 
that was formally specified and proven using B-method tools [304]. A 
good starting point for a formal DHT development would be the 
numerous informal API specifications [22, 305, 306]. Such work could 
be informed by other efforts to formally specify routing invariants 
[307, 308]. 
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Latency 
The key metrics for DHT latency are: 


1) Shortest-Path Distance and Diameter. In graph theory, the 
shortest-path distance is the minimum number of edges in any path 
between two vertices of the graph. Diameter is the largest of all 
shortest-path distances in a graph [309]. Networking synonyms for 
distance on a DHT are "hop count" and "lookup length". 


2) Latency and Latency Stretch. Two types of latency are relevant 
here -- network-layer latency and overlay latency. Network-layer 
latency has been referred to as "proximity" or "locality" [24]. 
Stretch is the cost of an overlay path between two nodes, divided 
by the cost of the direct network path between those nodes [310]. 
Latency stretch is also known as the "relative delay penalty" 
ESLL]: 


1. Hop Count and the O(1)-Hop DHTs 


Hop count gives an approximate indication of path latency. O(1)-hop 
DHTs have path latencies lower than the O(log n)-hop DHTs [291]. 

This significant advantage is often overlooked on account of concern 
about the messaging costs to maintain large routing tables (Section 
3.2.3). Such concern is justified when the mean node lifetime is 
only a few hours and the mean lookup interval per node is more than a 
few seconds (the classic profile of a P2P file-sharing node). 
However, for a large, practical operating range (node lifetimes of 
days or more, lookup rates of over tens of lookups per second per 
node, up to ~100,000 nodes), the total messaging cost in O(1) hop 
DHTs is lower than in O(log n) DHTs [312]. Lookups and routing table 
maintenance contribute to the total messaging cost. If a deployment 
fits this operating range, then O(1)-hop DHTs will give lower path 
latencies and lower total messaging costs. An additional merit of 
the O(1)-hop DHTs is that they yield lower lookup failure rates than 
their O(log N)-hop counterparts [291]. 


Low hop count can be achieved in two ways: each node has a large O(N) 
index of nodes; or the object references can be replicated on many 
nodes. Beehive [313], Kelips [39], LAND [310], and Tulip [314] are 
examples of the latter category. Beehive achieves O(1) hops on 
average and O(log n) hops in the worst case, by proactive replication 
of popular objects. Kelips replicates the 'file index’. It incurs 
O(sqrt (N)) storage costs for both the node index and the file index. 
LAND uses O(log n) reference pointers for each stored object and an 
O(log n) index to achieve a worst-case l+e stretch, where O<e. The 
Kelips-like Tulip [314] requires 2 hops per lookup. Each node 


son & Moors Informational [Page 21] 


RFC 4981 Survey of Research on P2P Search September 2007 


maintains 2sqrt(N)log(N) links to other nodes and objects are 
replicated on O(sqrt(N)) nodes. 


The DHTs with a large O(N) node index can be divided into two groups: 
those for which the index is always O(N); and those for which the 
index opportunistically ranges from O(log n) to O(N). Linear Hashing 
(LH*) servers [52], OneHop [267], and 1h-Calot [295] fall into the 
former category. EpiChord [315] and Accordion [316] are examples of 
the latter. 


3.3.2. Proximity and the O(log n)-Hop DHTs 


If one chooses not to use single-hop DHTs, hop count is a weak 
indicator of end-to-end path latency. Some hops may incur large 
delays because of intercontinental or satellite links. Consequently, 
numerous DHT designs minimize path latency by considering the 
proximity of nodes. Gummadi, et al. classified the proximity methods 
as follows [24]: 


1) Proximity Neighbor Selection (PNS). The nodes in the routing 
table are chosen based on the latency of the direct hop to those 
nodes. The latency may be explicitly measured [317], or it may be 
estimated using one of several synthetic coordinate systems [150, 
154, 318]. As a lower bound on PNS performance, Dabek, et al. 
showed that lookups on O(log n) DHTs take at least 1.5 times the 
average roundtrip time of the underlying network [154]. 


2) Proximity Route Selection (PRS). At lookup time, the choice of 
the next-hop node relies on the latency of the direct hop to that 
node. PRS is less effective than PNS, though it may complement it 
[24]. Some of the routing geometries in Section 3.5 do not 
support PNS and/or PRS [24]. 


3) Proximity Identifier Selection (PIS). Node identifiers indicate 
geographic position. PIS frustrates load balancing, increases the 
risk of correlated failures, and is not often used [24]. 


The proximity study by Gummadi, et al. assumed recursive routing, 
though they suggested that PNS would also be superior to PRS with 
iterative routing [24]. Dabek, et al. found that recursive lookups 
take 0.6 times as long as iterative lookups [150]. 


Beyond the explicit use of proximity information, redundancy can help 
to avoid slow paths and servers. One may increase the number of 
replicas [150], use parallel lookups [291, 316], use alternate routes 
on failure [150], or use multiple gateway nodes to enter the DHT 
[317]. 
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3.4. Multicasting 
3.4.1. Multicasting vs. Broadcasting 


"Multicasting" here means sending a message to a subset of an 
overlay’s nodes. Nodes explicitly join and leave this subset, called 


a "multicast group". "Broadcasting" here is a special case of 
multicasting in which a message is sent to all nodes in the overlay. 
Broadcasting relies on overlay membership messages -- it does not 


need extra group membership messaging. Castro, et al. said 
multicasting on structured overlays is either "flooding" (one overlay 
per group) or "tree-based" (one tree per group) [319]. These are 
synonyms for broadcasting and multicasting respectively. 


The first DHT-based designs for multicasting were CAN multicast 
[320], Scribe [241], Bayeux [242], and i3 [231]. They were based on 
CAN [8], Pastry [2], Tapestry [31], and Chord [7] respectively. El- 
Ansary, et al. devised the first DHT-based broadcasting scheme [321]. 
It was based on Chord. 


Multicast trees can be constructed using reverse-path forwarding or 
forward-path forwarding. Scribe uses reverse-path forwarding [241]. 
Bayeux uses forward-path forwarding [242]. Borg, a multicast design 
based on Pastry, uses a combination of forward-path and reverse-path 
forwarding to minimize latency [237]. 


3.4.2. Motivation for DHT-based Multicasting 


Multicasting complements DHT search capability. DHTs naturally 
support exact match queries. With multicasting, they can support 
more complex queries. Multicasting also enables the dissemination 
and collection of global information. 


Consider, for example, aggregation queries like minimum, maximum, 
count, sum, and average (Section 5.4). A node at the root of a 
dissemination tree might multicast such a query [322]. The leaf 
nodes return local results towards the root node. Successive parents 
aggregate the result so that eventually the root node can compute the 
global result. Such queries may help to monitor the capacity and 
health of the overlay itself. 


Why bother with structured overlays for multicasting? In Section 
2.1, we saw that Gnutella can multicast complex queries without them 
[4]. Castro, et al. posed the question, "Should we build Gnutella on 
a structured overlay?" [259]. While acknowledging that their study 
was preliminary, they did conclude that "we see no reason to build 
Gnutella on top of an unstructured overlay" [259]. The supposedly 
high maintenance costs of structured overlays were outweighed by 
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query cost savings. The structured overlay ensured that nodes were 
only visited once during a complex query. It also helped to 
accurately limit the total number of nodes visited. Pai, et al. 
acknowledged that multicast trees based on structured overlays 
contribute to simple routing rules, low delay and low delay variation 
[323]. However, they opted for unstructured, gossip-based 
multicasting for reliability reasons: data loss near the tree root 
affects all subtended nodes; interior node failures must be repaired 
quickly; interior nodes are obliged to disseminate more than their 


fair share of traffic, giving leaf nodes a "free ride". The most 
promising research direction is to improve on the Bimodal 
Multicasting approach [324]. It combines the bandwidth efficiency 


and low latency of structured, best-effort multicasting trees with 
the reliability of unstructured gossip protocols. 


3.4.3. Design Issues 


None of the early structured overlay multicast designs addressed all 
of the following issues [325]: 


1) Heterogeneous Node Capacity. Nodes differ in their processing, 
memory, and network capacity. Multicast throughput is largely 
determined by the node with smallest throughput [325]. To limit 
the multicasting load on a node, one might cap its out-degree. If 
the same node receives further join requests, it refers them to 
its children ("pushdown") [240]. Bharambe, et al. explored 
several pushdown strategies but found them inadequate to deal with 
heterogeneity [326]. They concluded that the heterogeneity issue 
remains open, and should be addressed before deploying DHTs for 
high-bandwidth multicasting applications. Independently, Zhang et 
al. partially tackled heterogeneity by allowing nodes in their 
CAM-Chord and CAM-Koorde designs to vary out-degree according to 
the node’s capacity [325]. However, they made no mention of the 
"pushdown" issue -- they did not describe topology maintenance 
when the out-degree limit is reached. 


2) Reliability (Dynamic Membership). If a multicast tree is to be 
resilient, it must survive dynamic membership. There are several 
ways to deal with dynamic membership: ensure that the root node of 
the multicasting tree does not handle all requests to join or 
leave the multicast group [242]; use multiple interior-node- 
disjoint trees to avoid single points of failure in tree 
structures [322]; and split the root node into several replicas 
and partition members across them [241]. For example, Bayeux 
requires the root node to track all group membership changes 
whereas Scribe does not [241]. CAN-multicast uses a single, 
well-known host to bootstrap the join operations [320]. The 
earliest DHT-based broadcasting work by El-Ansary, et al. did not 
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address the issue of dynamic membership [321]. Ghodsi, et al. 
addressed it in a subsequent paper, though, giving two broadcast 
algorithms that accommodate routing table inconsistencies [327]. 
One algorithm achieves a more optimal multicasting network at the 
expense of greater correction overhead. Splitstream, based on 
Scribe and Pastry, redundantly striped content across multiple 
interior-node-disjoint multicast trees -- if one interior node 
fails, then only one stripe is lost [240]. 


3) Large Any-Source Multicast Groups. Any group member should be 
allowed to send multicast messages. The group should scale toa 
very large number of hosts. CAN-based multicast was the first 
application-level multicast scheme to scale to groups of several 
thousands of nodes without restricting the service model to a 
single source [320]. Bayeux scales to large groups but has a 
single root node for each multicast group. It supports the any- 
source model only by having the root node operate as a reflector 
for multiple senders [242]. 


3.5. Routing Geometries 


In Sections 3.5.1 to 3.5.6, we introduce the main geometries for 
simple key lookup and survey their robustness mechanisms. 


3.5.1. Plaxton Trees (Pastry, Tapestry) 


Work began in March 2000 on a structured, fault-tolerant, wide-area 
Dynamic Object Location and Routing (DOLR) system called Tapestry [6, 
155]. While DHTs fix replica locations, a DOLR API enables 
applications to control object placement [31]. Tapestry’s basic 
location and routing scheme follows Plaxton, Rajaraman, and Richa 
(PRR) [30], but it remedies PRR’s robustness shortcomings described 
in Section 3.1. Whereas each object has one root node in PRR, 
Tapestry uses several to avoid a single point of failure. Unlike 
PRR, it allows nodes to be inserted and deleted. Whereas PRR 
required a total ordering of nodes, Tapestry uses 'surrogate routing’ 
to incrementally choose root nodes. The PRR algorithm does not 
address congestion, but Tapestry can put object copies close to nodes 
generating high query loads. PRR nodes only know of the nearest 
replica, whereas Tapestry nodes enable selection from a set of 
replicas (for example, to retrieve the most up to date). To detect 
routing faults, Tapestry uses TCP timeouts and UDP heartbeats for 
detection, sequential secondary neighbours for rerouting, anda 
"second chance’ window so that recovery can occur without the 
overhead of a full node insertion. Tapestry’s dependability has been 
measured on a testbed of about 100 machines and on simulations of 
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about 1000 nodes. Successful routing rates and maintenance 
bandwidths were measured during instantaneous failures and ongoing 
churn [31]. 


Pastry, like Tapestry, uses Plaxton-like prefix routing [2]. As in 
Tapestry, Pastry nodes maintain O(log n) neighbours and route to a 
target in O(log n) hops. Pastry differs from Tapestry only in the 
method by which it handles network locality and replication [2]. 
Each Pastry node maintains a 'leaf set’ and a 'routing table’. The 
leaf set contains 1/2 node IDs on either side of the local node ID in 
the node ID space. The routing table, in row r, column c, points to 
the node ID with the same r-digit prefix as the local node, but with 
an r+1 digit of c. A Pastry node periodically probes leaf set and 
routing table nodes, with periodicity of Tls and Trt and a timeout 
Tout. Mahajan, Castry, et al. analyzed the reliability versus 
maintenance cost trade-offs in terms of the parameters 1, Tls, Trt, 
and Tout [328]. They concluded that earlier concerns about excessive 
maintenance cost in a churning P2P network were unfounded, but 
suggested follow-up work for a wider range of reliability targets, 
maintenance costs, and probe periods. Rhea Geels, et al. concluded 
that existing DHTs fail at high churn rates [329]. Building on a 
Pastry implementation from Rice University, they found that most 
lookups fail to complete when there is excessive churn. They 
conjectured that short-lived nodes often leave the network with 
lookups that have not yet timed out, but no evidence was provided to 
confirm the theory. They identified three design issues that affect 
DHT performance under churn: reactive versus periodic recovery of 
peers; lookup timeouts; and choice of nearby neighbours. Since 
reactive recovery was found to add traffic to already congested 
links, the authors used periodic recovery in their design. For 
lookup timeouts, they advocated an exponentially weighted moving 
average of each neighbour’s response time, over alternative fixed 
timeout or ’virtual coordinate’ schemes. For selection of nearby 
neighbours, they found that “global sampling’ was more effective than 
simply sampling a /neighbour’s neighbours’ or 'inverse neighbours’. 
Castro, Costa, et al. have refuted the suggestion that DHTs cannot 
cope with high churn rates [330]. By implementing methods for 
continuous detection and repair, their MSPastry implementation 
achieved shorter routing paths and a maintenance overhead of less 
than half a message per second per node. 


There have been more recent proposals based on these early Plaxton- 
like schemes. Kademlia uses a bit-wise exclusive or (XOR) metric for 
the ‘distance’ between 160-bit node identifiers [45]. Each node 
keeps a list of contact nodes for each section of the node space that 
is between 2%i and 2^(i+1) from itself (0.i<160). Longer-lived nodes 
are deliberately given preference on this list -- it has been found 
in Gnutella that the longer a node has been active, the more likely 


Risson & Moors Informational [Page 26] 


RFC 4981 Survey of Research on P2P Search September 2007 


it is to remain active. Like Kademlia, Willow uses the XOR metric 
[32]. It implements a Tree Maintenance Protocol to ‘zipper’ together 
broken segments of a tree. Where other schemes use DHT routing to 
inefficiently add new peers, Willow can merge disjoint or broken 
trees in O(log n) parallel operations. 


3.5.2. Rings (Chord, DKS) 


Chord is the prototypical DHT ring, so we first sketch its operation. 
Chord maps nodes and keys to an identifier ring [7, 34]. Chord 
supports one main operation: find a node with the given key. It uses 
Consistent Hashing (Section 3.1) to minimize disruption of keys when 
nodes join and leave the network. However, Chord peers need only 
track O(log n) other peers, not all peers as in the original 


consistent hashing proposal [49]. It enables concurrent node 
insertions and deletions, improving on PRR. Compared to Pastry, it 
has a simpler join protocol. Each Chord peer tracks its predecessor, 


a list of successors, and a finger table. Using the finger table, 
each hop is at least half the remaining distance around the ring to 
the target node, giving an average lookup hop count of (1/2)log 
n(base 2). Each Chord node runs a periodic stabilization routine 
that updates predecessor and successor pointers to cater to newly 
added nodes. All successors of a given node need to fail for the 
ring to fail. Although a node departure could be treated the same as 
a failure, a departing Chord node first notifies the predecessor and 
successors, so as to improve performance. 


In their definitive paper, Chord’s inventors critiqued its 
dependability under churn [34]. They provided proofs on the 
behaviour of the Chord network when nodes in a stable network fail, 
stressing that such proofs are inadequate in the general case of a 
perpetually churning network. An earlier paper had posed the 
question, "For lookups to be successful during churn, how regularly 


do the Chord stabilization routines need to run?" [331]. Stoica, 
Morris, et al. modeled a range of node join/departure rates and 
stabilization periods for a Chord network of 1000 nodes. They 


measured the number of timeouts (caused by a finger pointing to a 
departed node) and lookup failures (caused by nodes that temporarily 
point to the wrong successor during churn). They also modeled the 
‘lookup stretch’, the ratio of the Chord lookup time to optimal 
lookup time on the underlying network. They demonstrated the latency 
advantage of recursive lookups over iterative lookups, but there 
remains room for delay reduction. For further work, the authors 
proposed to improve resilience to network partitions, using a small 
set of known nodes or ’remembered’ random nodes. To reduce the 
number of messages per lookup, they suggested an increase in the size 
of each step around the ring, accomplished via a larger number of 
fingers at each node. Much of the paper assumed independent, equally 
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likely node failures. Analysis of correlated node failures, caused 
by massive site or backbone failures, will be more important in some 
deployments. The paper did not attempt to recommend a fixed optimal 
stabilization rate. Liben-Nowell, Balakrishnan, et al. had suggested 
that optimum stabilization rate might evolve according to 
measurements of peers’ behaviour [331] -- such a mechanism has yet to 
be devised. 


Alima, El-Ansary, et al. considered the communication costs of 
Chord’s stabilization routines, referred to as ’active correction’, 
to be excessive [332]. Two other robustness issues also motivated 
their Distributed K-ary Search (DKS) design, which is similar to 
Chord. Firstly, the total system should evolve for an optimum 
balance between the number of peers, the lookup hop count, and the 
size of the routing table. Secondly, lookups should be reliable -- 
P2P algorithms should be able to guarantee a successful lookup for 
key/value pairs that have been inserted into the system. A similar 
lookup-correctness issue was raised elsewhere by one of Chord’s 
authors; "Is it possible to augment the data structure to work even 
when nodes (and their associated finger lists) just disappear?" [333] 
Alima, El-Ansary, et al. asserted that P2Ps using active correction, 
like Chord, Pastry, and Tapestry, are unable to give such a 
guarantee. They propose an alternate ’correction-on-use’ scheme, 
whereby expired routing entries are corrected by information 
piggybacking lookups and insertions. A prerequisite is that lookup 
and insertion rates are significantly higher than node arrival, 
departure, and failure rates. Correct lookups are guaranteed in the 
presence of simultaneous node arrivals or up to f concurrent node 
departures, where f is configurable. 


3.5.3. Tori (CAN) 


Ratnasamy, Francis, et al. developed the Content-Addressable Network 
(CAN), another early DHT widely referenced alongside Tapestry, 


Pastry, and Chord [8, 334]. It is arranged as a virtual 
d-dimensional Cartesian coordinate space on a d-torus. Each node is 
responsible for a zone in this coordinate space. The designers used 


a heuristic thought to be important for large, churning P2P networks: 
keep the number of neighbours independent of system size. 
Consequently, its design differs significantly from Pastry, Tapestry, 
and Chord. Whereas they have O(log n) neighbours per node and O(log 
n) hops per lookup, CAN has O(d) neighbours and O(dn” (1/d)) hop 
count. When CAN’s system-wide parameter d is set to log(n), CAN 
converges to their profile. If the number of nodes grows, a major 
rearrangement of the CAN network may be required [151]. The CAN 
designers considered building on PRR, but opted for the simple, low- 
state-per-node CAN algorithm instead. They had reasoned that a PRR- 
based design would not perform well under churn, given node 
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departures and arrivals would affect a logarithmic number of nodes 
[8]. 


There have been preliminary assessments of CAN’s resilience. When a 
node leaves the CAN in an orderly fashion, it passes its own Virtual 
ID (VID), its neighbours’ VIDs and IP addresses, and its key/value 
pairs to a takeover node. If a node leaves abruptly, its neighbours 
send recovery messages towards the designated takeover node. CAN 
ensures the recovery messages reach the takeover node, even if nodes 
die simultaneously, by maintaining a VID chain with Chord’s 
stabilization algorithm. Some initial 'proof of concept’ resilience 
simulations were run using the Network Simulator (NS) [335] for up to 
a few hundred nodes. Average hop counts and lookup failure 
probabilities were plotted against the total number of nodes for 
various node failure rates [8]. The CAN team documented several open 
research questions pertaining to state/hop count trade-offs, 
resilience, load, locality, and heterogeneous peers [44, 334]. 


3.5.4. Butterflies (Viceroy) 


Viceroy approximates a butterfly network [46]. It generally has 
constant degree like CAN. Like Chord, Tapestry, and Pastry, it has 
logarithmic diameter. It improves on these systems, inasmuch as its 
diameter is better than CAN and its degree is better than Chord, 
Tapestry, and Pastry. As with most DHTs, it utilizes Consistent 
Hashing. When a peer joins the Viceroy network, it takes a random 
but permanent 'identity” and selects its 'level” within the network. 
Each peer maintains general ring pointers (’predecessor’ and 
"successor’), level ring pointers (’nextonlevel’ and 'prevonlevel'), 
and butterfly pointers (’left’, 'right', and 'up'). When a peer 
departs, it normally passes its key pairs to a successor, and 
notifies other peers to find a replacement peer. 


The Viceroy paper scoped out the issue of robustness. It explicitly 
assumed that peers do not fail [46]. It assumed that join and leave 
operations do not overlap, so as to avoid the complication of 
concurrency mechanisms like locking. Kaashoek and Karger were 
somewhat critical of Viceroy’s complexity [37]. They also pointed to 
its fault-tolerance blind spot. Li and Plaxton suggested that such 
constant-degree algorithms deserve further consideration [47]. They 
offered several pros and cons. The limited degree may increase the 
risk of a network partition, or inhibit use of local neighbours (for 
the simple reason that there are less of them). On the other hand, 
it may be easier to reason about the correctness of fixed-degree 
networks. One of the Viceroy authors has since proposed constant- 
degree peers in a two-tier, locality-aware DHT [310] -- the lower 
degree maintained by each lower-tier peer purportedly improves 
network adaptability. Another Viceroy author has since explored an 
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alternative bounded-degree graph for P2P, namely the de Bruijn graph 
[336]. 


3.5.5. de Bruijn (D2B, Koorde, Distance Halving, ODRI) 


De Bruijn graphs have had numerous refinements since their inception 
[337, 338]. Schlumberger was the first to use them for networking 
[339]. Two research teams independently devised the ’generalized’ de 
Bruijn graph that accommodates a flexible number of nodes in the 
system [340, 341]. Rowley and Bose studied fault-tolerant rings 
overlaid on the de Bruijn graph [342]. Lee, Liu, et al. devised a 
two-level de Bruijn hierarchy, whereby clusters of local nodes are 
interconnected by a second-tier ring [343]. 


Many of the algorithms discussed previously are 'greedy” in that each 
time a query is forwarded, it moves closer to the destination. 
Unfortunately, greedy algorithms are generally suboptimal -- for a 
given degree, the routing distance is longer than necessary [344]. 
Unlike these earlier P2P designs, de Bruijn graphs of degree k 
achieve an asymptotically optimal diameter log n, where n is the 
number of nodes in the system and k can be varied to improve 
resilience. If there are O(log n) neighbours per node, the de Bruijn 
hop count is O(log n/log log n). To illustrate de Bruijn’s practical 
advantage, consider a network with one million nodes of degree 20: 
Chord has a diameter of 20, while de Bruijn has a diameter of 5 [36]. 


In 2003, there were a quick succession of de Bruijn proposals -- D2B 
[345], Koorde [37], Distance Halving [132, 336], and the Optimal 
Diameter Routing Infrastructure (ODRI) [36]. 


Fraigniaud and Gauron began the D2B design by laying out an informal 
problem statement: keys should be evenly distributed; lookup latency 
should be small; traffic load should be evenly distributed; updates 
of routing tables and redistribution of keys should be fast when 
nodes join or leave the network. They defined a node’s "congestion" 
to be the probability that a lookup will traverse it. Apart from its 
optimal de Bruijn diameter, they highlighted D2B’s merits: a constant 
expected update time when nodes join and leave (O(log n) with high 
probability (w.h.p.)); the expected node congestion is O((log n)/n) 
(O(((log n)?2)/n) w.h.p.) [345]. D2B’s resilience was discussed only 
in passing. 


Koorde extends Chord to attain the optimal de Bruijn degree/diameter 
trade-off above [37]. Unlike D2B, Koorde does not constrain the 
selection of node identifiers. Also unlike D2B, it caters to 
concurrent joins, by extension of Chord’s functionality. Kaashoek 
and Karger investigated Koorde’s resilience to a rather harsh failure 
scenario: "in order for a network to stay connected when all nodes 
fail with probability of 1/2, some nodes must have degree 
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omega(log n)" [37]. They sketched a mechanism to increase Koorde’s 
degree for this more stringent fault tolerance, losing de Bruijn’s 
constant degree advantage. Similarly, to achieve a constant-factor 
load balance, Koorde would have to sacrifice its degree optimality. 
They suggested that the ability to trade the degree, and hence the 
maintenance overhead, against the expected hop count may be important 
for churning systems. They also identified an open problem: find a 
load-balanced, degree optimal DHT. Datta, Girdzijauskas, et al. 
showed that for arbitrary key distributions, de Bruijn graphs fail to 
meet the dual goals of load balancing and search efficiency [346]. 
They posed the question, "(Is there) a constant routing table sized 
DHT which meets the conflicting goals of storage load balancing and 
search efficiency for an arbitrary and changing key distribution?" 


Distance Halving was also inspired by de Bruijn [336] and shares its 
optimal diameter. Naor and Wieder argued for a two-step 
"continuous-discrete" approach for its design. The correctness of 
its algorithms is proven in a continuous setting. The algorithms are 
then mapped to a discrete space. The source x and target y are 
points on the continuous interval [0,1). Data items are hashed to 
this same interval. <str> is a string that determines how messages 
leave any point on the ring: if bit t of the string is 0, the left 
leg is taken; if it is 1, the right leg is taken. <str> increases by 
one bit each hop, giving a sequence by which to step around the ring. 
A lookup has two phases. In the first, the lookup message containing 
the source, target, and the random string hops toward the midpoint of 
the source and target. On each hop, the distance between <str> (x) 
and <str>(y) is halved, by virtue of the specific ’left’ and 'right” 
functions. In the second phase, the message steps 'backward” from 
the midpoint to the target, removing the last bit in <str> at each 
hop. ‘Join’ and ’leave’ algorithms were outlined but there was no 
consideration of recovery times or message load on churn. Using the 
Distance Halving properties, the authors devised a caching scheme to 
relieve congestion in a large P2P network. They have also modified 
the algorithm to be more robust in the presence of random faults 
[1321% 


Solid comparisons of DHT resilience are scarce, but Loguinov, Kumar, 
et al. give just that in their ODRI paper [36]. They compare Chord, 
CAN, and de Bruijn in terms of routing performance, graph expansion 
and clustering. At the outset, they give the optimal diameter (the 
maximum hop count between any two nodes in the graph) and average hop 
count for graphs of fixed degree. De Bruijn graphs converge to both 
optima, and outperform Chord and CAN on both counts. These optima 
impact both delay and aggregate lookup load. They present two 
clustering measures (edge expansion and node expansion), which are 
interesting for resilience. Unfortunately, after decades of de 
Bruijn research, they have no exact solution. De Bruijn was shown to 
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be superior in terms of path overlap - "de Bruijn automatically 
selects backup paths that do not overlap with the best shortest path 
or with each other" [36]. 


3.5.6. Skip Graphs 


Skip Graphs have been pursued by two research camps [38, 41]. They 
augment the earlier Skip Lists [347, 348]. Unlike earlier balanced 
trees, the Skip List is probabilistic -- its insert and delete 
operations do not require tree rearrangements and so are faster by a 
constant factor. The Skip List consists of layers of ordered linked 
lists. All nodes participate in the bottom layer 0 list. Some of 
these nodes participate in the layer 1 list with some fixed 
probability. A subset of layer 1 nodes participate in the layer 2 
list, and so on. A lookup can proceed quickly through the list by 
traversing the sparse upper layers until it is close to, or at, the 
target. Unfortunately, nodes in the upper layers of a Skip List are 
potential hot spots and single points of failure. Unlike Skip Lists, 
Skip Graphs provide multiple lists at each level for redundancy, and 
every node participates in one of the lists at each level. 


Each node in a Skip Graph has theta(log n) neighbours on average, 
like some of the preceding DHTs. The Skip Graph’s primary edge over 
the DHTs is its support for prefix and proximity search. DHTs hash 
objects to a random point in the graph. Consequently, they give no 
guarantees over where the data is stored. Nor do they guarantee that 
the path to the data will stay within the one administration as far 
as possible [38]. Skip graphs, on the other hand, provide for 
location-sensitive name searches. For example, to find the document 
docname on the node user.company.com, the Skip Graph might step 
through its ordered lists for the prefix com.company.user [38]. 
Alternatively, to find an object with a numeric identifier, an 
algorithm might search the lowest layer of the Skip Graph for the 
first digit, the next layer for the next digit, in the same vein 
until all digits are resolved. Being ordered, Skip Graphs also 
facilitate range searches. In each of these examples, the Skip Graph 
can be arranged such that the path to the target, as far as possible, 
stays within an administrative boundary. If one administration is 
detached from the rest of the Skip Graph, routing can continue within 
each of the partitions. Mechanisms have been devised to merge 
disconnected segments [157], though at this stage, segments are re- 
merged one at a time. A parallel merge algorithm has been flagged 
for future work. 


The advantages of Skip Graphs come at a cost. To be able to provide 
range queries and data placement flexibility, Skip Graph nodes 
require many more pointers than their DHT counterparts. An increased 
number of pointers implies increased maintenance traffic. Another 
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shortcoming of at least one of the early proposals was that no 


algorithm was given to assign keys to machines. Consequently, there 
are no guarantees on system-wide load balancing or on the distance 
between adjacent keys [100]. Aspnes, Kirsch, et al. have recently 


devised a scheme to reduce the inter-machine pointer count from 
O(mlogm), where m is the number of data elements, to O(nlog n), where 
n is the number of nodes [100]. They proposed a two-layer scheme -- 
one layer for the Skip Graph itself and the second ’bucket layer’. 
Each machine is responsible for a number of buckets and each bucket 
elects a representative key. Nodes locally adjust their load. They 
accept additional keys if they are below their threshold or disperse 
keys to nearby nodes if they are above threshold. There appear to be 
numerous open issues: simulations have been done but analysis is 
outstanding; mechanisms are required to handle the arrival and 
departure of nodes; there were only brief hints as to how to handle 
nodes with different capacities. 


4. Semantic Index 


Semantic indexes capture object relationships. While the semantic- 
free methods (DHTs) have firmer theoretic foundations and guarantee 
that a key can be found if it exists, they do not capture the 
relationships between the document name and its content or metadata 
on their own. Semantic P2P designs do. However, since their design 
is often driven by heuristics, they may not guarantee that scarce 
items will be found. 


So what might the semantically indexed P2Ps add to an already crowded 
field of distributed information architectures? At one extreme, 
there are the distributed relational database management systems 
(RDBMSs), with their strong consistency guarantees [284]. They 
provide strong data independence, the flexibility of SQL queries, and 
strong transactional semantics -- Atomicity, Consistency, Isolation 
and Durability (ACID) [349]. They guarantee that the query response 
is complete -- all matching results are returned. The price is 
performance. They scale to perhaps 1000 nodes, as evidenced in 
Mariposa [350, 351], or require query caching front ends to constrain 
the load [284]. Database research has "arguably been cornered into 
traditional, high-end, transactional applications" [72]. Then there 
are distributed file systems, like the Network File System (NFS) or 
the Serverless Network File Systems (xFS), with little data 
independence, low-level file retrieval interfaces, and varied 
consistency [284]. Today's eclectic mix of Content Distribution 
Networks (CDNs) generally deload primary servers by redirecting Web 
requests to a nearby replica. Some intercept the HTTP requests at 
the DNS level and then use consistent hashing to find a replica [23]. 
Since this same consistent hashing was a forerunner to the DHT 
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approaches above, CDNs are generally constrained to the same simple 
key lookups. 


The opportunity for semantically indexed P2Ps, then, is to provide: 


a) graduated data independence, consistency, and query flexibility, 
and 


b) probabilistically complete query responses, across 


c) very large numbers of low-cost, geographically distributed, 
dynamic nodes. 


4.1. Keyword Lookup 


P2P keyword lookup is best understood by considering the structure of 
the underlying index and the algorithms by which queries are routed 
over that index. Figure 3 summarizes the following paragraphs by 
classifying the keyword query algorithms, index structures, and 
metrics. The research has largely focused on scalability, not 
dependability. There have been very few studies that quantify the 
impact of network churn. One exception is the work by Chawathe, et 
al. on the Gia system [61]. Gia’s combination of algorithms from 
Figure 3 (receiver-based flow control, biased random walk, and one- 
hop replication) gave 2-4 orders of magnitude improvement in query 
success rates in churning networks. 
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QUERY 
Query routing 

Flooding: Peers only index local files so queries must propagate 
widely [4] 

Policy-based: Choice of the next hop node: random; most/least 
recently used; most files shared; most results [265, 352] 

Random walks: Parallel [67] or biased random walks [61, 66] 

Query forwarding 

Iterative: Nodes perform iterative unicast searches of ultrapeers, 
until the desired number of results is achieved. See Gnutella 
UDP Extension for Scalable Searches (GUESS) [265, 353] 

Recursive 

Query flow control 

Receiver-controlled: Receivers grant query tokens to senders, so 
as to avoid overload [61] 

Reactive: sender throttles queries when it notices receivers are 
discarding packets [61, 66] 

Dynamic Time To Live: In the Dynamic Query Protocol, the sender 
adjusts the time-to-live on each iteration based on the number 
of results received, the number of connections left, and the 
number of nodes already theoretically reached by the search [354] 


INDEX 
Distribution 

Compression: Leaf nodes periodically send ultrapeers compressed 
query routing tables, as in the Query Routing Protocol [260] 

One hop replication: Nodes maintain an index of content on their 
nearest neighbors [61, 352] 

Partitioning 

By document [210] 

By keyword: Use an inverted list to find a matching document, 
either locally or at another peer [21]. Partition by keyword 
sets [355] 

By document and keyword: Also called Multi-Level Partitioning [21] 


METRIC 
Query load: Queries per second per node/link [65, 265] 
Degree: The number of links per node [66, 352]. Early P2P networks 


approximated power-law networks, where the number of nodes with L 
links is proportional to L”(-k), where k is a constant [65] 

Query delay: Reported in terms of time and hop count [61, 66] 

Query success rate: The "Collapse Point" is the per-node query rate 
at which the query success rate drops below 90% [61]. See 
also [61, 265, 352]. 


Figure 3: Keyword Lookup in P2P Systems 
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4.1. 


Ris 


1. Gnutella Enhancements 


Perhaps the most widely referenced P2P system for simple keyword 
match is Gnutella [4]. Gnutella queries contain a string of 
keywords. Gnutella peers answer when they have files whose names 
contain all the keywords. As discussed in Section 2.1, early 
versions of Gnutella did not forward the document index. Queries 
were flooded and peers searched their own local indexes for filename 
matches. An early review highlighted numerous areas for improvement 


[65]. It was estimated that the query traffic alone from 50,000 
early-generation Gnutella nodes would amount to 1.7% of the total 
U.S. Internet backbone traffic at December 2000 levels. It was 


speculated that high-degree Gnutella nodes would impede 
dependability. An unnecessarily high percentage of Gnutella traffic 
crossed Autonomous System (AS) boundaries -- a locality mechanism may 
have found suitable nearby peers. 


Fortunately, there have since been numerous enhancements within the 
Gnutella Developer Forum. At the time of writing, it has been 
reported that Gnutella has almost 350,000 unique hosts, of which 
nearly 90,000 accept incoming connections [356]. One of the main 
improvements is that an index of filename keywords, called the Query 
Routing Table (QRT), can now be forwarded from 'leaf peers’ to its 
‘ultrapeers’ [260]. Ultrapeers can then ensure that the leaves only 
receive queries for which they have a match, dramatically reducing 
the query traffic at the leaves. Ultrapeers can have connections to 
many leaf nodes (710-100) and a small number of other ultrapeers 
(<10) [260]. Originally, a leaf node’s ORT was not forwarded by the 
parent ultrapeer to other ultrapeers. More recently, there has been 
a proposal to distribute aggregated ORTs amongst ultrapeers [357]. 
To further limit traffic, ORTs are compressed by hashing, according 
to the Query Routing Protocol (ORP) specification [281]. This same 
specification claims QRP may reduce Gnutella traffic by orders of 
magnitude, but cautions that simulation is required before mass 
deployment. A known shortcoming of QRP was that the extent of query 
propagation was independent of the popularity of the search terms. 
The Dynamic Query Protocol addressed this [358]. It required leaf 
nodes to send single queries to high-degree ultrapeers that adjust 
the queries” time-to-live (TTL) bounds according to the number of 
received query results. An earlier proposal, called the Gnutella UDP 
Extension for Scalable Searches (GUESS) [353], similarly aimed to 
reduce the number of queries for widely distributed files. GUESS 
reuses the non-forwarding idea (Section 2). A GUESS peer repeatedly 
queries single ultrapeers with a TTL of 1, with a small timeout on 
each query to limit load. It chooses the number of iterations and 
selects ultrapeers so as to satisfy its search needs. For 
adaptability, a small number of experimental Gnutella nodes have 
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implemented eXtensible Markup Language (XML) schemas for richer 
queries [359, 360]. None of the above Gnutella proposals explicitly 
assess robustness. 


The broader research community has recently been leveraging aspects 
of the Gnutella design. Lv, Ratnasamy, et al. exposed one assumption 
implicit in some of the early DHT work -- that designs "such as 
Gnutella are inherently not scalable, and therefore should be 
abandoned" [66]. They argued that by making better use of the more 
powerful peers, Gnutella’s scalability issues could be alleviated. 
Instead of its flooding mechanism, they used random walks. Their 
preliminary design to bias random walks towards high capacity nodes 
did not go as far as the ultrapeer proposals in that the indexes did 
not move to the high-capacity nodes. Chawathe, Ratnasamy, et al. 
chose to extend the Gnutella design with their Gia system, in 
response to the perceived shortcomings of DHTs in Section 1.2 [61]. 
Compared to the early Gnutella designs, they incorporated several 
novel features. They devise a topology adaptation algorithm so that 
most peers are attached to high-degree peers. They use a random walk 
search algorithm, in lieu of flooding, and bias the query load 
towards higher-degree peers. For 'one-hop replication’, they require 
all nodes to keep pointers to content on adjacent peers. To 
implement a receiver-controlled token-based flow control, a peer must 
have a token from its neighbouring peer before it sends a query to 
it. Chawathe, Ratnasamy, et al. show by simulations that the 
combination of these features provides a scalability improvement of 
three to five orders of magnitude over Gnutella "while retaining 
significant robustness". The main robustness metrics they used were 
the 'collapse point’ query rate (the per-node query rate at which the 
successful query rate falls below 90%) and the average hop count 
immediately prior to collapse. Their comparison with Gnutella did 
not take into account the Gnutella enhancements above -- this was 
left as future work. Castro, Costa, and Rowstron argued that if 
Gnutella were built on top of a structured overlay, then both the 
query and overlay maintenance traffic could be reduced [259]. Yang, 
Vinograd, et al. explore various policies for peer selection in the 
GUESS protocol, since the issue is left open in the original proposal 


[265]. For example, the peer initiating the query could choose peers 
that have been "most recently used" or that have the "most files 
shared". Various policy pitfalls are identified. For example, good 


peers could be overloaded, victims of their own success. 
Alternatively, malicious peers could encourage the querying peer to 
try inactive peers. They conclude that a "most results" policy gives 
the best balance of robustness and efficiency. Like Castro, Costa, 
and Rowstron, they concentrated on the static network scenario. 
Cholvi, Felber, et al. very briefly describe how similar "least 
recently used" and "most often used" heuristics can be used by a peer 
to select peer 'acquaintances” [352]. They were motivated by the 
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congestion associated with Gnutella’s TTL-limited flooding. 
Recognizing that the busiest peers can quickly become overloaded 
central hubs for the entire network, they limit the number of 
acquaintances for any given peer to 25. They sketch a mechanism to 
decrement a query’s TTL multiple times when it traverses "interested 
peers". In summary, these Gnutella-related investigations are 
characterized by a bias for high-degree peers and very short directed 
query paths, a disdain for flooding, and concern about excessive load 
on the ‘better’ peers. Generally, the robustness analysis for 
dynamic networks (content updates and node arrivals/departures) 
remains open. 


4.1.2. Partition-by-Document, Partition-by-Keyword 


One aspect of P2P keyword search systems has received particular 
attention: should the index be partitioned by document or by keyword? 
The issue affects scalability. To be partitioned by document, each 
node has a local index of documents for which it is responsible. 
Gnutella is a prime example. Queries are generally flooded in 
systems partitioned by document. On the other hand, a peer may 
assume responsibility for a set of keywords. The peer uses an 
inverted list to find a matching document, either locally or at 
another peer. If the query contains several keywords, inverted lists 
may need to be retrieved from several different peers to find the 
intersection [21]. The initial assessment by Li, Loo, et al. was 
that the partition-by-document approach was superior [210]. For one 
scenario of a full-text Web search, they estimated the communications 
costs to be about six times higher than the feasible budget. 

However, wanting to exploit prior work on inverted list intersection, 
they studied the partition-by-keyword strategy. They proposed 
several optimizations that put the communication costs for a 
partition-by-keyword system within an order of magnitude of 
feasibility. There had been a couple of prior papers that suggested 
partitioned-by-keyword designs incorporate DHTs to map keywords to 


peers [355, 361]. In Gnawali’s Keyword-set Search System (KSS), the 
index is partitioned by sets of keywords [355]. Terpstra, Behnel, et 
al. point out that by keeping keyword pairs or triples, the number of 
lists per document in KSS is squared or tripled [362]. Shi, 
Guangwen, et al. interpreted the approximations of Li, Loo, et al. to 
mean that neither approach is feasible on its own [21]. Their 
Multi-Level Partitioning (MLP) scheme incorporates both partitioning 
approaches. They arrange nodes into a group hierarchy, with all 
nodes in the single 'level 0” group, and with the same nodes sub- 
divided into k logical subgroups on 'level 1’. The subgroups are 
again divided, level by level, until level 1. The inverted index is 


partitioned by document between groups and by keyword within groups. 
MLP avoids the query flooding normally associated with systems 
partitioned by document, since a small number of nodes in each group 
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process the query. It reduces the bandwidth overheads associated 
with inverted list intersection in systems partitioned solely by 
keyword, since groups can calculate the intersection independently 
over the documents for which they are responsible. MLP was overlaid 
on SkipNet, per Section 3.5.6 [38]. Some initial analyses of 
communications costs and query latencies were provided. 


4.1.3. Partial Search, Exhaustive Search 


Much of the research above addresses partial keyword search. 

Daswani, et al. highlighted the open problem of efficient, 
comprehensive keyword search [25]. How can exhaustive searches be 
achieved without flooding queries to every peer in the network? 
Terpstra, Behnel et al. couched the keyword search problem in 
rendezvous terms: dynamic keyword queries need to 'meet” with static 
document lists [362]. Their Bitzipper scheme is partitioned by 
document. They improved on full flooding by putting document 
metadata on 2sqrt(n) nodes and forwarding queries through only 
6sqrt(n) nodes. They reported that Bitzipper nodes need only 1/166th 
of the bandwidth of full-flooding Gnutella nodes for an exhaustive 
search. An initial comparison of query load was given. There was 
little consideration of either static or dynamic resilience; that is, 
of nodes failing, of documents continually changing, or of nodes 
continually joining and leaving the network. 


4.2. Information Retrieval 


The field of Information Retrieval (IR) has matured considerably 
since its inception in the 1950s [363]. A taxonomy for IR models has 
been formalized [262]. It consists of four elements: a 
representation of documents in a collection; a representation of user 
queries; a framework describing relationships between document 
representations and queries; and a ranking function that quantifies 
an ordering amongst documents for a particular query. Three main 
issues motivate current IR research -- information relevance, query 
response time, and user interaction with IR systems. The dominant IR 
trends for searching large text collections are also threefold [262]. 
The size of collections is increasing dramatically. More complicated 
search mechanisms are being found to exploit document structure, to 
accommodate heterogeneous document collections, and to deal with 
document errors. Compression is in favour -- it may be quicker to 
search compact text or retrieve it from external devices. Ina 
distributed IR system, query processing has four parts. Firstly, 
particular collections are targeted for the search. Secondly, 
queries are sent to the targeted collections. Queries are then 
evaluated at the individual collections. Finally, results from the 
collections are collated. 
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So how do P2P networks differ from distributed IR systems? Bawa, 
Manku, et al. presented four differences [62]. They suggested that a 
P2P network is typically larger, with tens or hundreds of thousands 
of nodes. It is usually more dynamic, with node lifetimes measured 
in hours. They suggested that a P2P network is usually homogeneous, 
with a common resource description language. It lacks the 
centralized "mediators" found in many IR systems that assume 
responsibility for selecting collections, for rewriting queries, and 
for merging ranked results. These distinctions are generally aligned 
with the peer characteristics in Section 1. One might add that P2P 
nodes display more symmetry -- peers are often both information 
consumers and producers. Daswani, Garcia-Molina, et al. pointed out 
that, while there are IR techniques for ranked keyword search at 
moderate scale, research is required so that ranking mechanisms are 
efficient at the larger scale targeted by P2P designs [25]. Joseph 
and Hoshiai surveyed several P2P systems using metadata techniques 
from the IR toolkit [60]. They described an assortment of IR 
techniques and P2P systems, including various metadata formats, 
retrieval models, bloom filters, DHTs, and trust issues. 


In the ensuing paragraphs, we survey P2P work that has incorporated 
information retrieval models, particularly the Vector Model and the 
Latent Semantic Indexing Model. We omit the P2P work based on 
Bayesian models. Some have pointed to such work [60], but made no 
explicit mention of the model [364]. One early paper on P2P 
content-based image retrieval also leveraged the Bayesian model 
[365]. For the former two models, we briefly describe the design, 
then try to highlight robustness aspects. On robustness, we are 
again stymied for lack of prior work. Indeed, a search across all 
proceedings of the Annual ACM Conference on Research and Development 
in Information Retrieval for the words "reliable", "available", 
"dependable", or "adaptable" did not return any results at the time 
of writing. In contrast, a standard text on distributed database 
management systems [366] contains a whole chapter on reliability. IR 
research concentrates on performance measures. Common performance 
measures include recall, the fraction of the relevant documents that 
has been retrieved and precision, the fraction of the retrieved 
documents that is relevant [262]. Ideally, an IR system would have 
high recall and high precision. Unfortunately techniques favouring 
one often disadvantage the other [363]. 
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4.2. 


Ris 


1. Vector Model (PlanetP, FASD, eSearch) 


The vector model [367] represents both documents and queries as term 
vectors, where a term could be a word or a phrase. If a document or 
query has a term, the weight of the corresponding dimension of the 
vector is non-zero. The similarity of the document and query vectors 
gives an indication of how well a document matches a particular 
query. 


The weighting calculation is critical across the retrieval models. 
Amongst the numerous proposals for the probabilistic and vector 

models, there are some commonly recurring weighting factors [363]. 
One is term frequency. The more a term is repeated in a document, 


the more important the term is. Another is inverse document 
frequency. Terms common to many documents give less information 
about the content of a document. Then there is document length. 


Larger documents can bias term frequencies, so weightings are 
sometimes normalized against document length. The expression "TFIDF 
weighting" refers to the collection of weighting calculations that 
incorporate term frequency and inverse document frequency, not just 
to one. Two weighting calculations have been particularly dominant 
-- Okapi [368] and pivoted normalization [369]. A distributed 
version of Google’s Pagerank algorithm has also been devised for a 
P2P environment [370]. It allows incremental, ongoing Pagerank 
calculations while documents are inserted and deleted. 


A couple of early P2P systems leveraged the vector model. Building 
on the vector model, PlanetP divided the ranking problem into two 
steps [215]. In the first, peers are ranked for the probability that 
they have matching documents. In the second, higher-priority peers 
are contacted and the matching documents are ranked. An Inverse Peer 
Frequency, analogous to the Inverse Document Frequency, is used to 
rank relevant peers. To further constrain the query traffic, PlanetP 
contacts only the first group of m peers to retrieve a relevant set 
of documents. In this way, it repeatedly contacts groups of m peers 
until the top k document rankings are stable. While the PlanetP 
designers first quantified recall and precision, they also considered 
reliability. Each PlanetP peer has a global index with a list of all 


other peers, their IP addresses, and their Bloom filters. This large 
volume of shared information needs to be maintained. Klampanos and 
Jose saw this as PlanetP’s primary shortcoming [371]. Each Bloom 


filter summarized the set of terms in the local index of each peer. 
The time to propagate changes, be they new documents or peer 
arrivals/departures, was studied by simulation for up to 1000 peers. 
The reported propagation times were in the hundreds of seconds. 
Design workarounds were required for PlanetP to be viable across 
slower dial-up modem connections. For future work, the authors were 
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considering some sort of hierarchy to scale to larger numbers of 
peers. 


A second early system using the vector model is the Fault-tolerant, 
Adaptive, Scalable Distributed (FASD) search engine [283], which 
extended the Freenet design (Section 2.3) for richer queries. The 
original Freenet design could find a document based on a globally 
unique identifier. Kronfol’s design added the ability to search, for 
example, for documents about "apples AND oranges NOT bananas". It 
uses a TFIDF weighting scheme to build a document’s term vector. 

Each peer calculates the similarity of the query vector and local 
documents and forwards the query to the best downstream peer. Once 
the best downstream peer returns a result, the second-best peer is 
tried, and so on. Simulations with 1000 nodes gave an indication of 
the query path lengths in various situations -- when routing queries 
in a network with constant rates of node and document insertion, when 
bootstrapping the network in a "worst-case" ring topology, or when 
failing randomly and specifically selected peers. Kronfol claimed 
excellent average-case performance -- less than 20 hops to retrieve 
the same top n results as a centralized search engine. There were, 
however, numerous cases where the worst-case path length was several 
hundred hops in a network of only 1000 nodes. 


In parallel, there have been some P2P designs based on the vector 
model from the University of Rochester -- pSearch [9, 372] and 
eSearch [373]. The early pSearch paper suggested a couple of 
retrieval models, one of which was the Vector Space Model, to search 
only the nodes likely to have matching documents. To obtain 
approximate global statistics for the TFIDF calculation, a spanning 
tree was constructed across a subset of the peers. For the m top 
terms, the term-to-document index was inserted into a Content- 
Addressable Network [334]. A variant that mapped terms to document 
clusters was also suggested. eSearch is a hybrid of the partition- 
by-document and partition-by-term approaches (Section 4.1.2) eSearch 
nodes are primarily partitioned by term. Each is responsible for the 
inverted lists for some top terms. For each document in the inverted 
list, the node stores the complete term list. To reduce the size of 
the index, the complete term lists for a document are only kept on 
nodes that are responsible for top terms in the document. eSearch 
uses the Okapi term weighting to select top terms. It relies on the 
Chord DHT [34] to associate terms with nodes storing the inverted 
lists. It also uses automatic query expansion. This takes the 
significant terms from the top document matches and automatically 
adds them to the user’s query to find additional relevant documents. 
The eSearch performance was quantified in terms of search precision, 
the number of retrieved documents, and various load-balancing 
metrics. Compared to the more common proposals for partitioning by 
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keywords, eSearch consumed 6.8 times the storage space to achieve 
faster search times. 


4.2.2. Latent Semantic Indexing (pSearch) 


Another retrieval model used in P2P proposals is Latent Semantic 
Indexing (LSI) [374]. Its key idea is to map both the document and 
query vectors to a concept space with lower dimensions. The starting 
point is a t*N weighting matrix, where t is the total number of 
indexed terms, N is the total number of documents, and the matrix 
elements could be TFIDF rankings. Using singular value 
decomposition, this matrix is reduced to a smaller number of 
dimensions, while retaining the more significant term-to-document 
mappings. Baeza-Yates and Ribeiro-Neto suggested that LSI’s value is 
a novel theoretic framework, but that its practical performance 
advantage for real document collections had yet to be proven [262]. 
pSearch incorporated LSI [9]. By placing the indices for 
semantically similar documents close in the network, Tang, Xu, et al. 
touted significant bandwidth savings relative to the early full- 
flooding variant of Gnutella [372]. They plotted the number of nodes 
visited by a query. They also explored the trade-off with accuracy, 
the percentage match between the documents returned by the 
distributed pSearch algorithm and those from a centralized LSI 
baseline. In a more recent update to the pSearch work, Tang, 
Dwarkadas, et al. summarized LSI’s shortcomings [375]. Firstly, for 
large document collections, its retrieval quality is inherently 
inferior to Okapi. Secondly, singular value decomposition consumes 
excessive memory and computation time. Consequently, the authors 
used Okapi for searching while retaining LSI for indexing. With 
Okapi, they selected the next node to be searched and selected 
documents on searched nodes. With LSI, they ensured that similar 
documents are clustered near each other, thereby optimizing the 
network search costs. When retrieving a small number of top 
documents, the precision of LSI+Okapi approached that of Okapi. 
However, if retrieving a large number of documents, the LSI+Okapi 
precision is inferior. The authors want to improve this in future 
work. 


4.2.3. Small Worlds 


The "small world" concept originally described how people are 
interconnected by short chains of acquaintances [376]. Kleinberg was 
struck by the algorithmic lesson of the small world, namely "that 
individuals using local information are collectively very effective 
at constructing short paths between two points in a social network" 


[377]. Small world networks have a small diameter and a large 
clustering coefficient (a large number of connections amongst 
relevant nodes) [378]. 
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The small world idea has had a limited impact on peer-to-peer 
algorithms. It has influenced only a few unstructured [62, 378-380] 
and structured [344, 381] algorithms. The most promising work on 
"small worlds" in P2P networks are those concerned with the 
information retrieval metrics, precision and recall [62, 378, 380]. 


5. Queries 


Database research suggests directions for P2P research. Hellerstein 
observed that, while work on fast P2P indexes is well underway, P2P 
query optimization remains a promising topic for future research 
[23]. Kossman reviewed the state of the art of distributed query 
processing, highlighting areas for future research: simulation and 
query optimization for networks of tens of thousands of servers and 
millions of clients; non-relational data types (e.g., XML, text, and 
images); and partial query responses since on the Internet, "failure 
is the rule rather than the exception" [19]. A primary motivation 
for the P2P system, PIER, was to scale from the largest database 
systems of a few hundred nodes to an Internet environment in which 
there are over 160 million nodes [22]. Litwin and Sahri have also 
considered ways to combine distributed hashing, more specifically the 
Scalable Distributed Data Structures, with SQL databases, claiming to 
be first to implement scalable distributed database partitioning 
[382]. Motivated by the lack of transparent distribution in current 
distributed databases, they measure query execution times for 
Microsoft SQL servers aggregated by means of an SDDS layer. One of 
their starting assumptions was that it is too challenging to change 
the SQL query optimizer. 


Database research also suggests the approach to P2P research. 
Researchers of database query optimization were divided between those 
looking for optimal solutions in special cases and those using 
heuristics to answer all queries [383]. Gribble, et al. cast query 
optimization in terms of the data placement problem, which is to 
"distribute data and work so the full query workload is answered with 
lowest cost under the existing bandwidth and resource constraints" 
[250]. They pointed out that even the static version of this problem 
is NP-complete in P2P networks. Consequently, research on massive, 
dynamic P2P networks will likely progress using both strategies of 
early database research - heuristics and special-case optimizations. 


If P2P networks are going to be adaptable, if they are to support a 
wide range of applications, then they need to accommodate many query 
types [72]. Up to this point, we have reviewed queries for keys 
(Section 3) and keywords (Sections 4.1. and 4.2). Unfortunately, a 
major shortcoming of the DHTs in Section 3.5 is that they primarily 
support exact-match, single-key queries. Skip Graphs support range 
and prefix queries, but not aggregation queries. Here we probe below 
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the language syntax to identify the open research issues associated 


with more expressive P2P queries [25]. Triantafillou and Pitoura 
observed the disparate P2P designs for different types of queries and 
so outlined a unifying framework [76]. To classify queries, they 


considered the number of relations (single or multiple), the number 
of attributes (single or multiple), and the type of query operator. 
They described numerous operators: equality, range, join, and 
"special functions". The latter referred to aggregation (like sum, 
count, average, minimum, and maximum), grouping and ordering. The 
following sections approximately fit their taxonomy -- range queries, 
multi-attribute queries, join queries and aggregation queries. There 
has been some initial P2P work on other query types -- continuous 
queries [20, 22, 73], recursive queries [22, 74], and adaptive 
queries [23, 75]. For these, we defer to the primary references. 


5.1. Range Queries 


The support of efficient range predicates in P2P networks was 
identified as an important open research issue by Huebsch, et al. 
[22]. Range partitioning has been important in parallel databases to 
improve performance, so that a transaction commonly needs data from 
only one disk or node [22]. One type of range search, longest prefix 
match, is important because of its prevalence in routing schemes for 
voice and data networks alike. In other applications, users may pose 
broad, inexact queries, even though they require only a small number 
of responses. Consequently, techniques to locate similar ranges are 
also important [77]. Various proposals for range searches over P2P 
networks are summarized in Figure 4. Since the Scalable Distributed 
Data Structure (SDDS) has been an important influence on contemporary 
Distributed Hash Tables (DHTs) [49-51], we also include ongoing work 
on SDDS range searches. 


PEER-TO-PEER (P2P) 

Locality Sensitive Hashing (Chord) [77] 
Prefix Hash Trees (unspecified DHT) [78, 79] 
Space Filling Curves (CAN) [80] 

Space Filling Curves (Chord) [81] 

Quadtrees (Chord) [82] 

Skip Graphs [38, 41, 83, 100] 

Mercury [84] 

P-Grid [85, 86] 


SCALABLE DISTRIBUTED DATA STRUCTURES (SDDS) 
RP* [87, 88] 


Figure 4: Solutions for Range Queries on P2P and SDDS Indexes 
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The papers on P2P range search can be divided into those that rely on 
an underlying DHT (the first five entries in Figure 4) and those that 
do not (the subsequent three entries). Bharambe, Agrawal, et al. 
argued that DHTs are inherently ill-suited to range queries [84]. 

The very feature that makes for their good load balancing properties, 
randomized hash functions, works against range queries. One possible 
solution would be to hash ranges, but this can require a priori 
partitioning. If the partitions are too large, partitions risk 
overload. If they are too small, there may be too many hops. 


Despite these potential shortcomings, there have been several range 
query proposals based on DHTs. If hashing ranges to nodes, it is 
entirely possible that overlapping ranges map to different nodes. 
Gupta, Agrawal, et al. rely on locality sensitive hashing to ensure 
that, with high probability, similar ranges are mapped to the same 
node [77]. They propose one particular family of locality sensitive 
hash functions, called min-wise independent permutations. The number 
of partitions per node and the path length were plotted against the 
total numbers of peers in the system. For a network with 1000 nodes, 
the hop count distribution was very similar to that of the exact- 
matching Chord scheme. Was it load-balanced? For the same network 
with 50,000 partitions, there were over two orders of magnitude 
variation in the number of partitions at each node (first and 
ninety-ninth percentiles). The Prefix Hash Tree is a trie in which 
prefixes are hashed onto any DHT. The preliminary analysis suggests 
efficient doubly logarithmic lookup, balanced load, and fault 
resilience [78, 79]. Andrzejak and Xu were perhaps the first to 
propose a mapping from ranges to DHTs [80]. They use one particular 
Space Filling Curve, the Hilbert curve, over a Content Addressable 
Network (CAN) construction (Section 3.5.3). They maintain two 
properties: nearby ranges map to nearby CAN zones; if a range is 
split into two sub-ranges, then the zones of the sub-ranges partition 
the zone of the primary range. They plot path length and load proxy 
measures (the total number of messages and nodes visited) for three 
algorithms to propagate range queries: brute force, controlled 
flooding, and directed controlled flooding. Schmidt and Parashar 
also advocated Space Filling Curves to achieve range queries over a 
DHT [81]. However, they point out that, while Andrzejak and Xu use 
an inverse Space Filling Curve to map a one-dimensional space to d- 
dimensional zones, they map a d-dimensional space back to a one- 
dimensional index. Such a construction gives the ability to search 
across multiple attributes (Section 5.2). Tanin, Harwood, et al. 
suggested quadtrees over Chord [82], and gave preliminary simulation 
results for query response times. 


Because DHTs are naturally constrained to exact-match, single-key 
queries, researchers have considered other P2P indexes for range 
searches. Several were based on Skip Graphs [38, 41], which, unlike 
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the DHTs, do not necessitate randomizing hash functions and are 
therefore capable of range searches. Unfortunately, they are not 
load balanced [83]. For example, in SkipNet [48], hashing was added 
to balance the load -- the Skip Graph could support range searches or 
load balancing, but not both. One solution for load-balancing relies 
on an increased number of 'virtual” servers [168] but, in their 
search for a system that can both search for ranges and balance 
loads, Bharambe, Agrawal, et al. rejected the idea [84]. The virtual 
servers work assumed load imbalance stems from hashing; that is, by 
skewed data insertions and deletions. In some situations, the 
imbalance is triggered by a skewed query load. In such 
circumstances, additional virtual servers can increase the number of 
routing hops and increase the number of pointers that a Skip Graph 
needs to maintain. Ganesan, Bawa, et al. devised an alternate method 
to balance load [83]. They proposed two Skip Graphs, one to index 
the data itself and the other to track load at each node in the 
system. Each node is able to determine the load on its neighbours 
and the most (least) loaded nodes in the system. They devise two 
algorithms: NBRADJUST balances load on neighbouring nodes; using 
REORDER, empty nodes can take over some of the tuples on heavily 
loaded nodes. Their simulations focus on skewed storage load, rather 
than on skewed query loads, but they surmise that the same approach 
could be used for the latter. 


Other proposals for range queries avoid both the DHT and the Skip 
Graph. Bharambe, Agrawal, et al. distinguish their Mercury design by 
its support for multi-attribute range queries and its explicit load 
balancing [84]. In Mercury, nodes are grouped into routing hubs, 
each of which is responsible for various query attributes. While it 
does not use hashing, Mercury is loosely similar to the DHT 
approaches: nodes within hubs are arranged into rings, like Chord 


[34]; for efficient routing within hubs, k long-distance links are 
used, like Symphony [381]. Range lookups require O(((log n)”2)/k) 
hops. Random sampling is used to estimate the average load on nodes 


and to find the parts of the overlay that are lightly loaded. 

Whereas Symphony assumed that nodes are responsible for ranges of 
approximately equal size, Mercury’s random sampling can determine the 
location of the start of the range, even for non-uniform ranges [84]. 
P-Grid [42] does provide for range queries, by virtue of the key 
ordering in its tree structures. Ganesan, Bawa, et al. critiqued its 
capabilities [83]: P-Grid assumes fixed-capacity nodes; there was no 
formal characterization of imbalance ratios or balancing costs; every 
P-Grid periodically contacts other nodes for load information. 


The work on Scalable Distributed Data Structures (SDDSs) has 
progressed in parallel with P2P work and has addressed range queries. 
Like the DHTs above, the early SDDS Linear Hashing (LH*) schemes were 
not order-preserving [52]. To facilitate range queries, Litwin, 
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Niemat, et al. devised a Range Parititioning variant, RP* [87]. 
There are options to dispense with the index, to add indexes to 
clients, and to add them to servers. In the variant without an 
index, every query is issued via multicasting. The other variants 
also use some multicasting. The initial RP* paper suggested 
scalability to thousands of sites, but a more recent RP* simulation 


was capped at 140 servers [88]. In that work, Tsangou, Ndiaye, et 
al. investigated TCP and UDP mechanisms by which servers could return 
range query results to clients. The primary metrics were search and 


response times. Amongst the commercial parallel database management 
systems, they reported that the largest seems only to scale to 32 
servers (SQL Server 2000). For future work, they planned to explore 
aggregation of query results, rather than establishing a connection 
between the client and every single server with a response. 


All in all, it seems there are numerous open research questions on 
P2P range queries. How realistic is the maintenance of global load 
statistics considering the scale and dynamism of P2P networks? 
Simulations at larger scales are required. Proposals should take 
into account both the storage load (insert and delete messages) and 
the query load (lookup messages). Simplifying assumptions need to be 
attacked. For example, how well do the above solutions work in 
networks with heterogeneous nodes, where the maximum message loads 
and index sizes are node-dependent? 


5.2. Multi-Attribute Queries 
There has been some work on multi-attribute P2P queries. As late as 
September 2003, it was suggested that there has not been an efficient 


solution [76]. 


Again, an early significant work on multi-attribute queries over 


aggregated commodity nodes germinated amongst SDDSs. k-RP* [89] uses 
the multi-dimensional binary search tree (or k-d tree, where k 
indicates the number of dimensions of the search index) [384]. It 


builds on the RP* work from the previous section and inherits their 
capabilities for range search and partial match. Like the other 
SDDSs, k-RP* indexes can fit into RAM for very fast lookup. For 
future work, Litwin and Neimat suggested a) a formal analysis of the 
range search termination algorithm and the k-d paging algorithm, b) a 
comparison with other multi-attribute data structures (quad-trees and 
R-trees) and c) exploration of query processing, concurrency control, 
and transaction management for k-RP* files [89]. On the latter 
point, others have considered transactions to be inconsequential to 
the core problem of supporting more complex queries in P2P networks 
[72]. 
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In architecting their secure wide-area Service Discovery Service 
(SDS), Hodes, Czerwinski, et al. considered three possible designs 
for multi-criteria search -- Centralization, Mapping and Flooding 
[90]. These correlate to the index classifications of Section 2 -- 
Central, Distributed, and Local. They discounted the centralized, 
Napster-like index for its risk of a single point of failure. They 
considered the hash-based mappings of Section 3, but concluded that 
it would not be possible to adequately partition data. A document 
satisfying many criteria would be wastefully stored in many 
partitions. They rejected full flooding for its lack of scalability. 
Instead, they devised a query filtering technique, reminiscent of 
Gnutella’s query routing protocol (Section 4.1). Nodes push 
proactive summaries of their data rather than waiting for a query. 
Summaries are aggregated and stored throughout a server hierarchy, to 
guide subsequent queries. Some initial prototype measurements were 
provided for total load on the system, but not for load distribution. 
They put several issues forward for future work. The indexing needs 
to be flexible to change according to query and storage workloads. A 
mesh topology might improve on their hierarchic topology since query 
misses would not propagate to root servers. The choice is analogous 
to BGP meshes and DNS trees. 


More recently, Cai, Frank, et al. devised the Multi-Attribute 
Addressable Network (MAAN) [91]. They built on Chord to provide both 
multi-attribute and range queries, claiming to be the first to 
service both query types in a structured P2P system. Each MAAN node 
has O(log n) neighbours, where N is the number of nodes. MAAN 
multi-attribute range queries require O(log n+N*Smin) hops, where 
Smin is the minimum range selectivity across all attributes. 
Selectivity is the ratio of the query range to the entire identifier 
range. The paper assumed that a locality preserving hash function 
would ensure balanced load. Per Section 5.1, the arguments by 
Bharambe, Agrawal, et al. have highlighted the shortcomings of this 
assumption [84]. MAAN required that the schema must be fixed and 
known in advance -- adaptable schemas were recommended for subsequent 
attention. The authors also acknowledged that there is a selectivity 
breakpoint at which full flooding becomes more efficient than their 
scheme. This begs for a query resolution algorithm that adapts to 
the profile of queries. Cai and Frank followed up with RDFPeers 
[55]. They differentiate their work from other RDF proposals by a) 
guaranteeing to find query results if they exist and b) removing the 
requirement of prior definition of a fixed schema. They hashed 
<subject, predicate, object> triples onto the MAAN and reported 
routing hop metrics for their implementation. Load imbalance across 
nodes was reduced to less than one order of magnitude, but the 
specific measure was the number of triples stored per node - skewed 
query loads were not considered. They plan to improve load balancing 
with the virtual servers of Section 5.1 [168]. 


Risson & Moors Informational [Page 49] 


RFC 4981 Survey of Research on P2P Search September 2007 


5.3. Join Queries 


Two research teams have done some initial work on P2P join 
operations. Harren, Hellerstein, et al. initially described a 
three-layer architecture -- storage, DHT and query processing. They 
implemented the join operation by modifying an existing Content 
Addressable Network (CAN) simulator, reporting "significant hot-spots 
in all dimensions: storage, processing, and routing" [72]. They 
progressed their design more recently in the context of PIER, a 
distributed query engine based on CAN [22, 385]. They implemented 
two equi-join algorithms. In their design, a key is constructed from 
the "namespace" and the "resource ID". There is a namespace for each 
relation and the resource ID is the primary key for base tuples in 
that relation. Queries are multicast to all nodes in the two 
namespaces (relations) to be joined. Their first algorithm is a DHT 
version of the symmetric hash join. Each node in the two namespaces 
finds the relevant tuples and hashes them to a new query namespace. 
The resource ID in the new namespace is the concatenation of join 
attributes. In the second algorithm, called "fetch matches", one of 
the relations is already hashed on the join attributes. Each node in 
the second namespace finds tuples matching the query and retrieves 
the corresponding tuples from the first relation. They leveraged two 
other techniques, namely the symmetric semi-join rewrite and the 
Bloom filter rewrite, to reduce the high bandwidth overheads of the 
symmetric hash join. For an overlay of 10,000 nodes, they simulated 
the delay to retrieve tuples and the aggregate network bandwidth for 
these four schemes. The initial prototype was on a cluster of 64 
PCs, but it has more recently been expanded to PlanetLab. 


Triantafillou and Pitoura considered multicasting to large numbers of 
peers to be inefficient [76]. They therefore allocated a limited 
number of special peers, called range guards. The domain of the join 
attributes was divided, one partition per range guard. Join queries 
were sent only to range guards, where the query was executed. 
Efficient selection of range guards and a quantitive evaluation of 
their proposal were left for future work. 


5.4. Aggregation Queries 


Aggregation queries invariable rely on tree-structures to combine 
results from a large number of nodes. Examples of aggregation 
queries are Count, Sum, Maximum, Minimum, Average, Median, and Top-K 
[92, 386, 387]. Figure 5 summarizes the tree and query 
characteristics that affect dependability. 
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Tree type: Doesn’t use DHT [92], use internal DHT trees [95], use 
independent trees on top of DHTs 

Tree repair: Periodic [93], exceptional [32] 

Tree count: One per key, one per overlay [56] 

Tree flexibility: Static [92], dynamic 


Query interface: install, update, probe [98] 

Query distribution: multicast [98], gossip [92] 

Query applications: leader election, voting, resource location, 
object placement and error recovery [98, 388] 

Query semantics 
Consistency: Best-effort, eventual [92], snapshot / interval / 

single-site validity [99] 

Timeliness [388] 
Lifetime: Continuous [97, 99], single-shot 
No. attributes: Single, multiple 

Query types: Count, sum, maximum, minimum, average, median, top k 
[92, 386, 387] 


Figure 5: Aggregation Trees and Queries in P2P Networks 


Key: Astrolabe [92]; Cone [93]; Distributed Approximative System 
Information Service (DASIS) [95]; Scalable Distributed Information 


Management System (SDIMS) [98]; Self-Organized Metadata Overlay 
(SOMO) [56]; Wildfire [99]; Willow [32]; Newscast [97] 


The fundamental design choices for aggregation trees relate to how 
the overlay uses DHTs, how it repairs itself when there are failures, 
how many aggregation trees there are, and whether the tree is static 
or dynamic (Figure 5). Astrolabe is one of the most influential P2P 
designs included in Figure 5, yet it makes no use of DHTs [92]. 

Other designs make use of the internal trees of Plaxton-like DHTs. 
Others build independent tree structures on top of DHTs. Most of the 
designs repair the aggregation tree with periodic mechanisms similar 
to those used in the DHTs themselves. Willow is an exception [32]. 
It uses a Tree Maintenance Protocol to "zip" disjoint aggregation 
trees together when there are major failures. Yalagandula and Dahlin 
found reconfigurations at the aggregation layer to be costly, 
suggesting more research on techniques to reduce the cost and 
frequency of such reconfigurations [98]. Many of the designs use 
multiple aggregation trees, each rooted at the DHT node responsible 
for the aggregation attribute. On the other hand, the Self-Organized 
Metadata Overlay [56] uses a single tree and is vulnerable to a 
single point of failure at its root. 
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At the time of writing, researchers have just begun exploring the 
performance of queries in the presence of churn. Most designs are 
for best-effort queries. Bawa, et al. devised a better consistency 
model, called Single-Site Validity [99] to qualify the accuracy of 
results when there is churn. Its price was a five-fold increase in 
the message load, when compared to an efficient but best-effort 
Spanning Tree. Gossip mechanisms are resilient to churn, but they 
delay aggregation results and incur high message cost for aggregation 
attributes with small read-to-write ratios. 


6. Security Considerations 


An initial list of references to research on P2P security is given in 
Figure 1, Section 1. This document addresses P2P search. P2P 
storage, security, and applications are recommended for further 
investigation in Section 8. 


7. Conclusions 


Research on peer-to-peer networks can be divided into four categories 
-- search, storage, security and applications. This critical survey 
has focused on search methods. While P2P networks have been 
classified by the existence of an index (structured or unstructured) 
or the location of the index (local, centralized, and distributed), 
this survey has shown that most have evolved to have some structure, 
whether it is indexes at superpeers or indexes defined by DHT 
algorithms. As for location, the distributed index is most common. 
The survey has characterized indexes as semantic and semantic-free. 
It has also critiqued P2P work on major query types. While much of 
it addresses work from 2000 or later, we have traced important 
building blocks from the 1990s. 


The initial motivation in this survey was to answer the question, 
"How robust are P2P search networks?" The question is key to the 
deployment of P2P technology. Balakrishnan, Kaashoek, et al. argued 
that the P2P architecture is appealing: the startup and growth 
barriers are low; they can aggregate enormous storage and processing 


resources; "the decentralized and distributed nature of P2P systems 
gives them the potential to be robust to faults or intentional 
attacks" [18]. If P2P is to be a disruptive technology in 


applications other than casual file sharing, then robustness needs to 
be practically verified [20]. 


The best comparative research on P2P dependability has been done in 
the context of Distributed Hash Tables (DHTs) [291]. The entire body 
of DHT research can be distilled to four main observations about 
dependability (Section 3.2). Firstly, static dependability 
comparisons show that no O(log n) DHT geometry is significantly more 
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dependable than the other O(log n) geometries. Secondly, dynamic 
dependability comparisons show that DHT dependability is sensitive to 


the underlying topology maintenance algorithms (Figure 2). Thirdly, 
most DHTs use O(log n) geometries to suit ephemeral nodes, whereas 
the O(1) hop DHTs suit stable nodes - they deserve more research 


attention. Fourthly, although not yet a mature science, the study of 
DHT dependability is helped by recent simulation tools that support 
multiple DHTs [299]. 


We make the following four suggestions for future P2P research: 


1) Complete the companion P2P surveys for storage, security, and 
applications. A rough outline has been suggested in Figure 1, 
along with references. The need for such surveys was highlighted 
within the peer-to-peer research group of the Internet Research 
Task Force (IRTF) [17]. 


2) P2P indexes are maturing. P2P queries are embryonic. Work on 
more expressive queries over P2P indexes started to gain momentum 
in 2003, but remains fraught with efficiency and load issues. 


3) Isolate the low-level mechanisms affecting robustness. There is 
limited value in comparing robustness of DHT geometries (like 
rings versus de Bruijn graphs), when robustness is highly 
sensitive to underlying topology maintenance algorithms (Figure 
2). 


4) Build consensus on robustness metrics and their acceptable ranges. 
This paper has teased out numerous measures that impinge on 
robustness, for example, the median query path length for a 
failure of x% of nodes, bisection width, path overlap, the number 
of alternatives available for the next hop, lookup latency, 
average live bandwidth (bytes/node/sec), successful routing rates, 
the number of timeouts (caused by a finger pointing to a departed 
node), lookup failure rates (caused by nodes that temporarily 
point to the wrong successor during churn), and clustering 
measures (edge expansion and node expansion). Application-level 
robustness metrics need to drive a consistent assessment of the 
underlying search mechanics. 
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